This track will take you behind the curtain and into the heart of system meltdowns at some of the world's leading software companies in "The stories behind the incidents" track. Learn directly from SREs about real-world, high-impact production failures at scale, including the immediate challenges of triage, diagnosis, and mitigation in complex distributed systems. From these stories, you’ll gain insights into the nature of real incidents and how skilled SREs recover from them.
You’ll learn about the ambiguous, confusing, and uncertain nature of incidents when you’re in the middle of them, and hear the tales of how engineers were able to improvise innovative solutions in order to restore service. You’ll also learn how fundamentally unpredictable incidents are, and, consequently, the importance of preparing to be surprised.
From this track
The Human Toll of Incidents & Ways To Mitigate It
Wednesday Nov 19 / 10:35AM PST
Have you ever wondered what it's like to respond to a significant incident? Walk through an hour by hour reconstruction of an incident response or two, focusing on what it was like to be "in the room" and the human response to the incidents.

Kyle Lexmond
Production Engineer @Meta, Previously @AWS and @Twitter
When Incidents Refuse to End
Wednesday Nov 19 / 11:45AM PST
As engineers, we’re used to managing failure, but long-running outages hit differently. They stretch teams, systems, and assumptions about how incidents “should” play out.

Vanessa Huerta Granda
Resiliency Manager @Enova, Co-Author of the Howie Guide on Post Incident Analysis
The Ironies of AAII
Wednesday Nov 19 / 01:35PM PST
Details coming soon.

Paul Reed
Staff Incident Operations Manager @Chime
Week-Long Outage: Lifelong Lessons
Wednesday Nov 19 / 02:45PM PST
Routine database upgrades should be straightforward, especially with familiar, well-established technology. We were confident heading into our Elasticsearch upgrade, equipped with a solid plan and excited to see performance gains like we had seen from past upgrades.

Molly Struve
Staff Site Reliability Engineer @Netflix
Rebuilding A System After a Security Breach
Wednesday Nov 19 / 03:55PM PST
Details coming soon.

Sean Klein
Principal Technical Program Manager - Modern Incident Analysis @Microsoft Azure