You are viewing content from a past/completed QCon - February 2021. Check out our upcoming events.

Managing the Risk of Cascading Failure

Cascading failures are vicious cycles, where an initial problem triggers failures elsewhere, which continues to spread. We see these kinds of failures sometimes in our distributed systems, often as a result of client retries. They are difficult to scale out of - your new serving capacity just becomes overwhelming.  

In this talk, we’ll talk about some of the mechanisms that cause cascading failures, what we can do to reduce the risk, and what to do if you find yourself in a cascading failure situation.


Laura Nolan

Site Reliability Engineer @Slack, Contributor to Seeking SRE, & SRECon Steering Committee

Laura Nolan's background is in Site Reliability Engineering, software engineering, distributed systems, and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly 'Site Reliability Engineering' book, as well as contributing to the more recent 'Seeking SRE'. She is a member of the USENIX SREcon steering committee. 


Learn more about the organizations that joined us on this journey