Monday, March 8, 2021

SRE - Measuring and reliability

 SRE - Measuring and reliability

Goals of measuring:

  • IT teams and business can understand the current status of the service. 
  • Teams can analyze data and can take necessary action to improve the status
  • Make better decision based on the collected metrics
Measuring everything makes your decisons data driven and one can make better decisons based on operational data that is collected.

SREs look for:
  1. Measuring Reliability (make sure reliability percentage is high)
  2. Measuring Toil (make sure toil percentage is decreasing)
    1. Count tickets
    2. Count alerts
    3. Collect Statistics
  3. Monitoring (What to monitor)
    1. Alerts per cause will lead to a. lot of noise and spam alerts.
    2. Capacity alerts (scaling helps)
    3. Latency
    4. Traffic
    5. Errors
    6. Saturation

SRE Skills and training:

What skills to train (for engineers moving to SRE or who are already working as an SRE)
  • Operations and Software Engineering
  • Monitoring systems
  • Production Automation
  • System Architecture
  • Troubleshooting
  • Culture of trust (shared ownership requires a culture of trust)
  • Incident Management
Types of SRE Team implementations:
  • Kitchen Sink (or Everything SRE)
    • Scope is unbounded. 
    • Good place to start SRE in an Organization
    • For Orgs with few applications and need one SRE
    • Since one SRE team, no coverage gaps.
  • Infrastructure
    • Helps makes other teams jobs easier
    • Maintains shared services
    • For Orgs with multiple developer teams.
    • Disadvantage is that the improvements made by this team may not be directly linked with customer experience.
  • Tools
    • Builds tools for monitoring etc
    • Does not work for customer experience
    • Risk of increase of toil.
  • Product Application
    • Works to improve reliability of an application or product.
    • Provides clear focus and link between business priorities and team effort expenditure.
  • Embedded
    • SREs embedded with Developers.
    • Project or time bound
    • Less time for mentoring
  • Consulting
    • Less hands on
    • Not recommended until org complexities is large.
    • Can help to scale and add to an existing SRE team.
    • Risk is lack of sufficient context

Previous

No comments:

Post a Comment