Getting Good At System Failure Analysis


Every failure is a mystery to be solved. Solving those mysteries is a skill that can be honed. Let’s talk about how to get better at figuring out what’s up when things go wrong! This is a talk full of both high level advice and concrete tips from somebody who loves fixing weird production issues.

What does it mean to be good at debugging production issues? That’s the question we’ll explore in this talk! I’ll be sharing a grab bag of the postures, practices, tips, and tricks I’ve learned from years hanging out near production.

Running production systems are not always designed for operability, and yet we still need to fix them. Thusly, my goal is to share techniques that apply across a range of operational maturity levels. This breaks down into a few sections:

  • Adopting a productive attitude towards failures
  • Learning to love logs, wherever you may find them
  • Guerrilla systems thinking and domain modeling
  • Code reading for failure analysis
  • Collaborating to remediate and solve production issues

Production failure analysis has been one of the most rewarding skills that I’ve built up in my career. I hope that after this talk you’ll have a few tools to walk away with, but - more importantly - you’ll be inspired to get better at responding to failures.

Speaker

paul-hinze

Paul Hinze

 

In his career, Paul has been consistently drawn to Production: its affinity for chaos, its unforgiving nature, and ultimately its deep longing for attention. This has gotten him into trouble again and again. Once, he found himself in charge of production operations at a payments company. Then he stumbled on a globally scaled AWS application, where he worked on deployment and automation. Today, Paul has finally embraced his true nature at HashiCorp, where he works on tools that help others who feel the same call of Production.