March 2021

Monday, March 8, 2021

SRE - Measuring and reliability

SRE - Measuring and reliability

Goals of measuring:

IT teams and business can understand the current status of the service.
Teams can analyze data and can take necessary action to improve the status
Make better decision based on the collected metrics

Measuring everything makes your decisons data driven and one can make better decisons based on operational data that is collected.

SREs look for:

Measuring Reliability (make sure reliability percentage is high)
Measuring Toil (make sure toil percentage is decreasing)

Count tickets
Count alerts
Collect Statistics

Monitoring (What to monitor)

Alerts per cause will lead to a. lot of noise and spam alerts.
Capacity alerts (scaling helps)
Latency
Traffic
Errors
Saturation

SRE Skills and training:

What skills to train (for engineers moving to SRE or who are already working as an SRE)

Operations and Software Engineering
Monitoring systems
Production Automation
System Architecture
Troubleshooting
Culture of trust (shared ownership requires a culture of trust)
Incident Management

Types of SRE Team implementations:

Kitchen Sink (or Everything SRE)

Scope is unbounded.
Good place to start SRE in an Organization
For Orgs with few applications and need one SRE
Since one SRE team, no coverage gaps.

Infrastructure

Helps makes other teams jobs easier
Maintains shared services
For Orgs with multiple developer teams.
Disadvantage is that the improvements made by this team may not be directly linked with customer experience.

Tools

Builds tools for monitoring etc
Does not work for customer experience
Risk of increase of toil.

Product Application

Works to improve reliability of an application or product.
Provides clear focus and link between business priorities and team effort expenditure.

Embedded

SREs embedded with Developers.
Project or time bound
Less time for mentoring

Consulting

Less hands on
Not recommended until org complexities is large.
Can help to scale and add to an existing SRE team.
Risk is lack of sufficient context

Previous

Sunday, March 7, 2021

SRE - Concepts and Automation

SRE - Concepts and Automation

As discussed before, implementing gradual changes reduces the cost of failure.

SREs look at ways to implement changes to few users before it is released to a wider audience. e.g Implement feature flags to turn on/off features per customer. Using these flags, SREs can implement changes to a small set of customers thereby reducing the cost of failure. Gradually the feature flag can be set to true for all customers.

For an SRE, change is best when it is small and frequent.

SREs focus on CI/CD and canarying.

CI/CD (Continuous Integration/Continuous Delivery)

Continuous Integration refers to building, integrating and testing code within the development environment. Main goal is to enable engineers to work on code and test more often.

Continuous delivery means that one can frequently push code to production. This stage involves continuous integration, testing automation and deployment automation.

How does this help?

Minimize code integration challenges
Promotes higher code quality
Easier to recover (e.g turn off feature flag)
Time to market is shorter
Provides more metrics to work on

Canarying

There is a phrase "Canary in a coal mine". Canaries are birds which are small and breathe faster than humans. Coal miners would take canaries with them to detect dangerous gases. If the canaries died, it would imply the miners were in danger.

To simplify:

We have something larger that we don't want to risk (human lives)
We can risk a smaller object (canaries)
Smaller object detects danger

In SRE terms:

Canarying is deploying a change in service to a small set of users who are unaware of it.
Evaluate the impact to the group
Decide how to proceed based on the metrics collected
If it was a buggy release, the impact was to a small set of users.
Buggy release can be rolled back quickly

Note that the canary population should be small enough so it does not endanger the whole service if broken.

TOIL Automation

As per SREs "if a human operator has to touch your system during normal operations, you have a bug".

What is TOIL?

TOIL is work directly related to a service that is:

Manual
Repetitive
Automatable
Without enduring value
Scales linearly as the service grows

SREs look to reducing toil and scale services. This is the Engineering aspect of SREs.

SREs look to reduce TOIL by automation.

Automation provides the following values:

Consistency (manual human actions are prone to error)
Quick resolutions
Time saved

SRE Measuring and Reliability

SRE Practices

Lets look at some of the SRE practices:

Accept failure as normal with blameless postmortems

Mistakes are inevitable when working at high velocities and development operations. Accepting failures as normal is one of the pillars of DevOps philosophy.

But how does one implement this in their organization?

Experienced SREs:

are comfortable with failures knowing that incidents and site outages are going to occur even if all the necessary precautions are taken care of.
build monitoring to eliminate ambiguity
build observability and document processes for incident response and handoff.

After an outage it is important to know/understand why an incident occurred and take steps to ensure it doesn't repeat.

This is done by conducting a blameless postmortem (also called retrospective meetings or retros)
Systematic approach is taken so that the team learns from every incident.
Components of a blameless postmortem:

Details of the incident and its timeline.
The actions taken to mitigate or resolve the incident
The incidents impact
Its trigger or root-cause
The follow up actions taken to prevent it to re-occur.

Do note that no particular individual or team is blamed for the incident and hence the name blameless postmortem.
Intent of the postmortem:

Ensure the root causes is understood by the entire team.
Define or take effective action to ensure the incident doesn't occur again.
Reduce the likelihood of outages
Avoid multiplying complexities
Learn from your mistakes and those of others.

Blamelssness can have an immense positive effect in your organization and create a culture of psychological safety for our teams.

In work environments with low psychological safety:

Team members keep their ideas/concerns to themselves
It can stifle learning and innovation

Good organizations do not focus on people when an outage or incident occurs but focuses more on the system and processes.

Service Level Objectives (SLOs) and Error Budgets

Developers generally aren't familiar with the system on which their code runs and operations team have to deal with unstable code at times. And as we discussed before, due to agile methodology, the code is churned out at high velocity and that makes it tougher for the operations team who prefer slow progress to ensure consistency and stability.

As we read before, SRE tries to break down the silos between development teams and Operations and promotes shared ownership. These fundamentals helps teams maintain reliability of their services.

What is an Error budget?

We define service reliability as a fraction of Good Interactions v/s Total Interactions.

[ (Good Interactions) / (Total Interactions) ] = Fraction of real users who experience a service that is working and available.

Error budget is the amount of unreliability that you are willing to tolerate

If we strive for 100% reliability, we will slow down new service features which will impact business. This is where error budget comes in. As long as our reliability is > the error budget, new releases can be pushed out.

Essentially, error budget is an agreement that helps prioritize engineering work.

It helps to find the right balance between innovation and reliability.

SLOs

The performance of our system relative to SLOs should guide our business decisions.

SLOs are precise numerical targets for system reliability.

How to define SLOs?

Service level Indicator (SLI) tells you at any moment in time how well your service is doing. Its a quantifiable measure of service reliability.

SLI is expressed as a ratio of:

Good events divided by the number of valid events multiplied by 100%. This gives us a range between 0% and 100%. 0% indicates nothing works and 100% indicates nothing is broken.

SLI should map to users expectations like response time, latency and quality.

An SLO is your target for SLI aggregated over time.

SLO should generally be just short of 100% like 99.999% etc.

Anything below your SLO will result in an unhappy customer whose expectations for a reliable service is not being met.

Previous: Introduction

Next: SRE concepts and automation

Saturday, March 6, 2021

Site Reliability Engineering (SRE)

What is Site Reliability Engineering (SRE)?

Site Reliability Engineering, or SRE, is both a practice and a job role, where engineering directly supports software operations. At Google, SRE is described by its founder as "what happens when you ask a software engineer to design an operations team."

SRE involves Ops work (tickets, on call and manual tasks) + development (internal tools, SRE tools and building automatic systems).

Need to track metric for the time spent on Ops work and Development work.

Over time, the Ops percentage should decrease.

Development and Operations teams have conflicting priorities and continue to work in silos.

Standardising practices to balance velocity (of development features) with the risk to reliability - these practices combined with a culture to support, forms the core of SRE.

Since SRE principles are closely aligned with DevOps practices, lets understand what is DevOps and why we need DevOps.

DevOps:

An IT team consists of Developers and Operations.
Developers are responsible for writing code for the system.
Operations team is responsible to ensure those systems operate reliably.
Ideally both the teams strive to make the end consumer/customer happy with the product.
Developers follow an agile model to deploy and push code as quickly as possible. So, developers want to work faster.
Operations team prefer to work slower since they have to keep the system reliable. They focus on reliability and consistency.
Overtime, this sometimes resulted in Operations team inheriting code with less understanding of how it would run in production.
This would eventually hurt business needs since the priorities were not aligned.
How to close the gap between Developers and Operations? Enter "DevOps"

Reduce Organizational silos
Implement gradual change (reduce cost of failure)
Automate manual work
Measure everything

Remember that DevOps is a philosophy (its not a development methodology or technology). It highlights critical ways for IT teams to operate, it doesn't define on how an organization should implement practices to become successful. That's where SRE comes in.

SREs are engineers who are responsible for Operations.
SRE is a practice and a role.
SREs share ownership of production with Developers. This helps to reduce Organizational silos.
Together, SREs and Developers define Service Level Objectives (SLOs).
They together share responsibility to determine reliability and prioritize work.
This promotes shared vision and knowledge.
SREs aim to reduce cost of failure by rolling out changes to a small percentage of users before rolling it world wide. This can be achieved by implementing code changes like feature flags which can be turned on/off for a small percentage of users and/or removed completely to implement worldwide. This also promotes more design thinking and proto-typing.
SREs also focus on automation to reduce the amount of manual repetitive work.
SREs measure everything related to automation, reliability and the health of their systems. This leads to transparency and data driven decision making.
Do remember the GOAL of the SRE is to serve the business and the end user.

SRE Mission:

To protect, provide for and progress software and systems with consistent focus on availability, latency, performance and capacity.

Next: SRE Practices