As teams grow and systems become more complicated, what used to be a simple set of commands quickly become branching logic chains that require insight and expertise. Runbooks serve as a decent structure for communicating this to teams around the globe but may miss crucial information that only an expert would know. Finally, runbooks are only as good as the last time they were updated, and in regular use, they stay out of typical workflows and become forgotten until they’re needed most.
This talk describes Slack’s approach to good runbook hygiene, as well as our process for moving to automated tools, and how it’s helped us scale our teams and infrastructure. We’ll dive into the tradeoffs of automation, as well as how to make sure the process is accessible to all members of the team, allowing them to gain familiarity and skills with tooling. Attendees will be able to not only improve their runbook contents and format but learn how to make code speak for operational procedures and ensure they are working best when needed most.
I’m a software engineer on the Operations Team at Slack. I love working on tough distributed systems problems, and learning from my team and the community! I have been a fond user and contributor to Open Source for over 10 years, with no plans on stopping anytime soon.
Previously, I was the Mesos/Aurora SRE Tech Lead at Twitter, and an Internal Technology Resident at Google. I received a Computer Science BS in May 2010 from Chapman University in Orange, California.