Database-Backed Program Analysis for Finding Cascading Outage Bugs in Distributed Systems

Riza Suminto; Shan Lu; Cindy Rubio-González; Haryadi S. Gunawi. 12 January, 2021.
Communicated by Haryadi Gunawi.


Modern distributed systems (“cloud systems”) have emerged as a dominant backbone for many of today’s applications. As these systems collectively become the “cloud operating system”, users expect high depend- ability including performance stability and availability. Small jitters in system performance or minutes of service downtimes can have a huge impact on company and user satisfaction.

We try to improve cloud system availability by detect- ing and eliminating cascading outage bugs (CO bugs). CO bug is a bug that can cause simultaneous or cascades of failures to each of the individual nodes in the system, which eventually leads to a major outage. While hard- ware arguably is no longer a single point of failure, our large-scale studies of cloud bugs and outages reveal that CO bugs have emerged as a new class of outage-causing bugs and single point of failure in the software. We ad- dress the CO bug problem with the Cascading Outage Bugs Elimination (COBE) project. In this project, we: (1) study the anatomy of CO bugs, (2) develop CO-bug detection tools to unearth CO bugs.

Original Document

The original document is available in PDF (uploaded 12 January, 2021 by Haryadi Gunawi).