Understand

templeate to understand the service

service context

  1. what does the system do
  2. who is the customer, what is their primary use case?
  3. what is the user flow for the primary use case
  4. how is the customer impacted when the system is degraded
  5. what service level objectives have set in order to achieve the desired customer experience
  6. what service level indicators do we use to measure teh experience we want to deliver

pre-game checklist

before blueprint phase

  1. toolbox
    1. runbooks
    2. pagerduty service
    3. datadog dashboards
  2. complete the service context
  3. verify the test environment is healthy
  4. prepare and validate load generation test
  5. prepare failure injection with Gremlin

gameday

  • roles and responsibility
    • gameday coordinator
    • oncall / triage engineers
    • attendees (observe and validate the situation)