“A system is resilient if it can adjust its functioning prior to, during, or following events (changes, disturbances, and opportunities), and thereby sustain required operations under both expected and unexpected conditions.”—Erik Hollnagel
The software has bugs. The systems sometimes fail. People make mistakes. These are fundamental truths of technology. Hiring the best engineers in the world won’t change this. The best-performing teams and companies build reliable software despite bugs and mistakes. These “unicorn” companies are pushing the boundaries of software reliability through chaos engineering and by embracing resilience engineering. They hire the best and brightest systems engineers to work alongside their software developers to build more reliable systems.
But do companies that aren’t unicorns need to become experts in human factors and experts in their software stack in order to engineer reliable systems?
Jessica DeVita tells the story of how a team at Microsoft challenged themselves to retrospect their retrospectives and shares what they learned about applying human factors ideas to software development. You’ll learn how a nonexpert can contribute to software robustness and resilience, gain ideas on how to approach an unfamiliar software engineering system, and discover how to investigate the roles that language, accountability, error propagation, and hidden system resilience play in a software engineering system.