Improved Resilience by Testing in Production

Abstract:

You know the story. You wrote some code, diligently wrote your unit and integration tests, babysat the deploy, performed a smoke test, and your new feature is shipped and looking good. Five minutes later though and the metrics in some unrelated part of the system have all tanked, everyone's getting paged, and it looks like it was your deploy that did it. You sheepishly roll back to make the system stable again. What happened? You did everything you were supposed to.

Reliability is paramount. Downtime creates irate customers and lost business opportunity. In this talk, we'll discuss how to improve the overall resilience of your systems, and prevent situations like the above, by testing in production. We will review four different tools and processes implemented at PagerDuty that test various dimensions of its production systems: Failure Friday: Intentionally injecting failure into production systems Watchdog: Constantly validating the functional specification Canary Deploys: Verifying forward progress is made during deploys End to End Provider Testing: Ensuring availability of our third party telephony providers For each of these, we will discuss their origin story, their tooling and functionality, pitfalls to watch out for, learnings we've gathered, and how they've measurably improved reliability.

Speaker:

Kenneth Rose is a Principal Software Engineer at PagerDuty in Toronto. Ken has worked on everything from the C# compiler at Microsoft, to 3D graphics tools at Autodesk, to most recently writing highly available event processing systems at PagerDuty.

Improved Resilience by Testing in Production

Past

Future