Saturday, March 6, 2021

Site Reliability Engineering (SRE)

 What is Site Reliability Engineering (SRE)? 

Site Reliability Engineering, or SRE, is both a practice and a job role, where engineering directly supports software operations. At Google, SRE is described by its founder as "what happens when you ask a software engineer to design an operations team."

SRE involves Ops work (tickets, on call and manual tasks) + development (internal tools, SRE tools and building automatic systems).

Need to track metric for the time spent on Ops work and Development work.

Over time, the Ops percentage should decrease.


Development and Operations teams have conflicting priorities and continue to work in silos.


Standardising practices to balance velocity (of development features) with the risk to reliability - these practices combined with a culture to support, forms the core of SRE.


Since SRE principles are closely aligned with DevOps practices, lets understand what is DevOps and why we need DevOps.


DevOps:

  • An IT team consists of Developers and Operations.
  • Developers are responsible for writing code for the system.
  • Operations team is responsible to ensure those systems operate reliably.
  • Ideally both the teams strive to make the end consumer/customer happy with the product.
  • Developers follow an agile model to deploy and push code as quickly as possible. So, developers want to work faster.
  • Operations team prefer to work slower since they have to keep the system reliable. They focus on reliability and consistency.
  • Overtime, this sometimes resulted in Operations team inheriting code with less understanding of how it would run in production. 
  • This would eventually hurt business needs since the priorities were not aligned.
  • How to close the gap between Developers and Operations? Enter "DevOps"
    • Reduce Organizational silos
    • Implement gradual change (reduce cost of failure)
    • Automate manual work
    • Measure everything
  • Remember that DevOps is a philosophy (its not a development methodology or technology). It highlights critical ways for IT teams to operate, it doesn't define on how an organization should implement practices to become successful. That's where SRE comes in.
    • SREs are engineers who are responsible for Operations.
    • SRE is a practice and a role.
    • SREs share ownership of production with Developers. This helps to reduce Organizational silos.
    • Together, SREs and Developers define Service Level Objectives (SLOs).
    • They together share responsibility to determine reliability and prioritize work.
    • This promotes shared vision and knowledge.
    • SREs aim to reduce cost of failure by rolling out changes to a small percentage of users before rolling it world wide. This can be achieved by implementing code changes like feature flags which can be turned on/off for a small percentage of users and/or removed completely to implement worldwide. This also promotes more design thinking and proto-typing.
    • SREs also focus on automation to reduce the amount of manual repetitive work.
    • SREs measure everything related to automation, reliability and the health of their systems. This leads to transparency and data driven decision making.
    • Do remember the GOAL of the SRE is to serve the business and the end user.
SRE Mission:
To protect, provide for and progress software and systems with consistent focus on availability, latency, performance and capacity.


No comments:

Post a Comment