Sunday, March 7, 2021

SRE Practices

Lets look at some of the SRE practices: 

Accept failure as normal with blameless postmortems

  • Mistakes are inevitable when working at high velocities and development operations. Accepting failures as normal is one of the pillars of DevOps philosophy.

But how does one implement this in their organization?

Experienced SREs: 

  • are comfortable with failures knowing that incidents and site outages are going to occur even if all the necessary precautions are taken care of.
  • build monitoring to eliminate ambiguity 
  • build observability and document processes for incident response and handoff.
After an outage it is important to know/understand why an incident occurred and take steps to ensure it doesn't repeat. 
  • This is done by conducting a blameless postmortem (also called retrospective meetings or retros)
  • Systematic approach is taken so that the team learns from every incident.
  • Components of a blameless postmortem:
    • Details of the incident and its timeline.
    • The actions taken to mitigate or resolve the incident
    • The incidents impact
    • Its trigger or root-cause 
    • The follow up actions taken to prevent it to re-occur.
  • Do note that no particular individual or team is blamed for the incident and hence the name blameless postmortem.
  • Intent of the postmortem:
    • Ensure the root causes is understood by the entire team.
    • Define or take effective action to ensure the incident doesn't occur again.
    • Reduce the likelihood of outages
    • Avoid multiplying complexities
    • Learn from your mistakes and those of others.
Blamelssness can have an immense positive effect in your organization and create a culture of psychological safety for our teams.

In work environments with low psychological safety:
  • Team members keep their ideas/concerns to themselves
  • It can stifle learning and innovation
Good organizations do not focus on people when an outage or incident occurs but focuses more on the system and processes. 

Service Level Objectives (SLOs) and Error Budgets

Developers generally aren't familiar with the system on which their code runs and operations team have to deal with unstable code at times. And as we discussed before, due to agile methodology, the code is churned out at high velocity and that makes it tougher for the operations team who prefer slow progress to ensure consistency and stability.

As we read before, SRE tries to break down the silos between development teams and Operations and promotes shared ownership. These fundamentals helps teams maintain reliability of their services.

What is an Error budget?

We define service reliability as a fraction of Good Interactions v/s Total Interactions.

[ (Good Interactions) / (Total Interactions) ] = Fraction of real users who experience a service that is working and available.

Error budget is the amount of unreliability that you are willing to tolerate 

If we strive for 100% reliability, we will slow down new service features which will impact business. This is where error budget comes in. As long as our reliability is > the error budget, new releases can be pushed out.

Essentially, error budget is an agreement that helps prioritize engineering work.

It helps to find the right balance between innovation and reliability.

SLOs

The performance of our system relative to SLOs should guide our business decisions.

SLOs are precise numerical targets for system reliability.

How to define SLOs?

Service level Indicator (SLI) tells you at any moment in time how well your service is doing. Its a quantifiable measure of service reliability.

SLI is expressed as a ratio of:

Good events divided by the number of valid events multiplied by 100%. This gives us a range between 0% and 100%. 0% indicates nothing works and 100% indicates nothing is broken.

SLI should map to users expectations like response time, latency and quality.

An SLO is your target for SLI aggregated over time.

SLO should generally be just short of 100% like 99.999% etc.

Anything below your SLO will result in an unhappy customer whose expectations for a reliable service is not being met. 

Previous: Introduction

Next: SRE concepts and automation 

No comments:

Post a Comment