Over the past five years of being on-call, seeing hundreds of production outages, and working with a growing SRE team, I've started to see patterns when it comes to making teams work effectively.
In this talk, I share an actual walkthrough of a production incident. The tooling we used. What we thought was going on. How we diagnosed it over time. How we eventually got to the root cause. And how we postmortemed it afterward.
I think there are a ton of great 'general' guides out there, like the Google SRE book, or posts on a culture of blameless postmortems. But few actually tell the story and walk through a 'play-by-play' of how an outage was handled.
Here, I share the exact details of what we did and why–and share some tips around the tools and processes that have made the biggest differences in terms of reliability.