Black Friday: Lessons in Resiliency and Incident Response at Shopify


Black Friday is the scariest and most exciting day of the year at Shopify. We see nearly a year’s worth of growth in one day. In 2017 we had a very smooth Black Friday, but behind the scenes we had several large scale failures to our infrastructure. Although lots of things went wrong, there was nearly zero impact to our merchants. This talk will describe our approach to incident response and follow up, which focuses on reducing the impact of failures rather than on prevention. We will go behind the scenes on the major failures we had during Black Friday 2017, and describe in detail the layers of resiliency built into the architecture that saved us from disaster. We’ll also describe our approach to incident response, both in responding to a crisis with a level head, and in driving follow up from incidents to make systems stronger. These basic principles of resiliency and incident response practice can be applied within any team to help reduce impact when your next inevitable infrastructure failure happens.

Speaker

john-arthorne

John Arthorne

 
John leads a developer team within the Shopify Production Engineering group, with a focus on building tools to improve the quality of production systems, and on engineering incident response. John is ...