Prioritizing Explosions

How do I prioritize spending limited cycles on unlimited potential problems?

Failure Mode and Effects Analysis (FMEA) is a technique stolen from “real world” reliability engineering that allows us to create objectivity within a subjective domain.

By brainstorming a list of all the failure modes of a system and then subjectively ranking and quantifying them, we can generate a relative priority for each case, allowing for clear agreement between teams on which cases (if any) warrant action or automation.

A roadmap and example walkthru of the process will include factors such as

  • Severity: How much $ or reputation do we lose?
    • (e.g., airlines and music companies can both use the process, even though the human consequences of an outage can be vastly different)
    • One goes down and we can’t listen to music; the other goes down and we die, but BOTH have to decide how to prioritize their limited resources
  • Occurrence:
    • Burrs in deployment cause a great deal more friction at 100/wk than 1/mo – at $COMPANY test failures inside a single giant deploy script were painful, but QUANTIFYING them drove $COMPANY to actually implement a pipeline solution
  • Detection: How hard is ${this failure case} to catch?
    • Misconfigurations occur with varying frequency, but when they only become apparent $NEXT_RELEASE, they were MUCH harder to catch
    • It’s very easy to say “we can monitor that”–but the question is “What if it happened RIGHT NOW?”
  • Prioritization: Actually doing the math


Quantifying things lets us set “objective” thresholds for where and when we need metrics, monitoring and automation – and just as important, it lets us PRIORITIZE those action items in a way that keeps both mgmt and engineering happy, on the same page, and focused on the money we’re making (or at least, not losing).



Jody Lent

I write Python as an SRE at GrubHub. I enjoy cracking open black boxes, distilling complex topics to simplicity, and single malt whisky.