Maintainability

Maintainability

  • Operability Make it easy for operations teams to keep the system running smoothly.
  • Simplicity Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system. (Note this is not the same as simplicity of the user interface.)
  • Evolvability Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibil‐ity, modifiability, or plasticity.

Operability

  • Monitoring the health of the system and quickly restoring service if it goes into a bad state
  • Tracking down the cause of problems, such as system failures or degraded per‐formance
  • Keeping software and platforms up to date, including security patches
  • Keeping tabs on how different systems affect each other, so that a problematic change can be avoided before it causes damage
  • Anticipating future problems and solving them before they occur (e.g., capacity planning)
  • Establishing good practices and tools for deployment, configuration manage‐ment, and more
  • Performing complex maintenance tasks, such as moving an application from one platform to another
  • Maintaining the security of the system as configuration changes are made
  • Defining processes that make operations predictable and help keep the produc‐tion environment stable
  • Preserving the organization’s knowledge about the system, even as individual people come and go
  • Providing visibility into the runtime behavior and internals of the system, with good monitoring
  • Providing good support for automation and integration with standard tools
  • Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninter‐rupted)
  • Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)
  • Providing good default behavior, but also giving administrators the freedom to override defaults when needed
  • Self-healing where appropriate, but also giving administrators manual control over the system state when needed
  • Exhibiting predictable behavior, minimizing surprises