Reliability
Reliability
- continue to work correctly
- fault tolerate
Hardware faults
- add redundancy
- Disks may be set up in a RAID configuration
- servers may have dual power supplies and hot-swappable CPUs
- datacenters may have batteries and diesel generators for backup power.
- When one component dies, the redundant component can take its place while the broken component is replaced.
Software errors
bug, process uses up resources, service depends on slow down, cascading failures
- carefully thinking about assumptions and interactions in the system
- thorough testing, process isolation
- allowing processes to crash and restart
- measuring, monitoring and analyzing system behavior in production
human errors
humans are known to be unreliable.
Design systems in a way that minimizes opportunities for error.
Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.
Allow quick and easy recovery from human errors, to minimize the impact in the case of a failure.