Xero originally had a single operations team that that managed all production incidents. As Xero has grown, it has been necessary to empower product teams to support their own services. To do this, Xero’s Site Reliability Engineering (SRE) team has developed a set of best practices around incident management. The challenge was to make it easy for other teams to adopt these practices, which is where Xero’s incident management chat-bot was born.
“Multivac” is our automated guide through Xero SRE’s incident management framework. It helps users define roles and responsibilities for an incident, communicate with a wider audience, track down other teams to help and generally attempts to reduce the time to service restoration. In this talk, I’ll discuss why we built Multivac and how it has become an indispensable aide in managing our production environment.
I’m Anthony Angell the Team lead for Site Reliability Engineering at Xero In Auckland!.
I’ve been doing Incident Management for a little on 7 years… I’ve Been apart of the team that has designed the Incident Management Framework here at Xero & been part of Developing the Training for oncall engineers + Incident Management.
One of the amusing questions who is my favourite member of one direction? Really! I Can’t Choose!!!