Understand
templeate to understand the service
service context
- what does the system do
- who is the customer, what is their primary use case?
- what is the user flow for the primary use case
- how is the customer impacted when the system is degraded
- what service level objectives have set in order to achieve the desired customer experience
- what service level indicators do we use to measure teh experience we want to deliver
pre-game checklist
before blueprint phase
- toolbox
- runbooks
- pagerduty service
- datadog dashboards
- complete the service context
- verify the test environment is healthy
- prepare and validate load generation test
- prepare failure injection with Gremlin
gameday
- roles and responsibility
- gameday coordinator
- oncall / triage engineers
- attendees (observe and validate the situation)