There are a lot of great things about the cloud, but the “destroy and rebuild” philosophy which is really good for building a continuous delivery pipeline, really sucks when applied to troubleshooting production problems. When your application goes haywire, the most valuable engineering skill is not the the ability to bring up a copy of your system or even the knowledge of a your technology stack (although it doesn’t hurt). It is the skill of understanding and solving problems.
Finding the root cause of the issue and mitigating it with minimal disruption in production is a must-have skill for engineers responsible for managing and maintaining production systems, which nowadays includes ops, dbas and devs alike. In this talk I will discuss the skills required to troubleshoot complex systems, traits that prevent engineers from being successful at troubleshooting and discuss some techniques and tips and trick for troubleshooting complex systems in production.
Leon’s two decades of expertise were concentrated on architecting and operating complex, web-based systems to withstand crushing traffic (often unexpectedly). Over the years, he’s had a somewhat unique opportunity to design and build systems that run some of the most visited websites in the world. While his core expertise is in application development, he works his way around the whole technology stack from system architecture to databases design and optimization to front/back-end programming. He’s considered a professional naysayer by peers and has the opinion that nothing really works until it works for at least a million people.