Sunday, March 7, 2021

SRE - Concepts and Automation

 SRE - Concepts and Automation

As discussed before, implementing gradual changes reduces the cost of failure. 

SREs look at ways to implement changes to few users before it is released to a wider audience. e.g Implement feature flags to turn on/off features per customer. Using these flags, SREs can implement changes to a small set of customers thereby reducing the cost of failure. Gradually the feature flag can be set to true for all customers.

For an SRE, change is best when it is small and frequent.

SREs focus on CI/CD and canarying.

CI/CD (Continuous Integration/Continuous Delivery)

Continuous Integration refers to building, integrating and testing code within the development environment. Main goal is to enable engineers to work on code and test more often.

Continuous delivery means that one can frequently push code to production. This stage involves continuous integration, testing automation and deployment automation.

How does this help?
  • Minimize code integration challenges
  • Promotes higher code quality
  • Easier to recover (e.g turn off feature flag)
  • Time to market is shorter
  • Provides more metrics to work on
Canarying

There is a phrase "Canary in a coal mine". Canaries are birds which are small and breathe faster than humans. Coal miners would take canaries with them to detect dangerous gases. If the canaries died, it would imply the miners were in danger.

To simplify:
  • We have something larger that we don't want to risk (human lives)
  • We can risk a smaller object (canaries)
  • Smaller object detects danger 
In SRE terms:
  • Canarying is deploying a change in service to a small set of users who are unaware of it.
  • Evaluate the impact to the group
  • Decide how to proceed based on the metrics collected
  • If it was a buggy release, the impact was to a small set of users.
  • Buggy release can be rolled back quickly
Note that the canary population should be small enough so it does not endanger the whole service if broken.

TOIL Automation

As per SREs "if a human operator has to touch your system during normal operations, you have a bug".

What is TOIL?

TOIL is work directly related to a service that is:
  • Manual
  • Repetitive
  • Automatable
  • Without enduring value 
  • Scales linearly as the service grows
SREs look to reducing toil and scale services. This is the Engineering aspect of SREs.

SREs look to reduce TOIL by automation.

Automation provides the following values:
  • Consistency (manual human actions are prone to error)
  • Quick resolutions
  • Time saved

Previous:





No comments:

Post a Comment