Causal and Interpretable Learning for Datacenter Latency Prediction

Yi Ding; Avinash Rao; Henry Hoffmann. 20 December, 2020.
Communicated by Henry Hoffmann.


Stragglers---computations that exhibit extreme tail latencies---present a major challenge to delivering predictable performance in datacenters. Accurately predicting stragglers would enable efficient, proactive intervention. While a number of approaches use machine learning to predict computer system performance, they routinely rely on carefully curated training sets. Assembling the right training set requires prior knowledge, i.e., sufficient examples of all possible behaviors to be identified, which is challenging when new workloads could be unique and unlike any workload from training. To make accurate predictions with neither prior knowledge nor carefully curated datasets this paper presents Agatha, a straggler prediction framework that augments existing machine learning approaches with causal analysis. Agatha weights predictions using a statistical method—called propensity scoring—to account for a lack of stragglers in training data. Additionally, Agatha exploits model-agnostic interpretation to gain insights into the reasons for straggling behavior. We evaluate Agatha on datacenter traces from Google and Alibaba and find that, compared to prior work that is heavily reliant on the training set, Agatha (1) produces much more accurate predictions, (2) does so much earlier in job execution, (3) provides interpretable results, and (4) reduces completion time. Agatha achieves these improved predictions even with no examples of stragglers in its training set, making it better suited for practical deployments.

Original Document

The original document is available in PDF (uploaded 20 December, 2020 by Henry Hoffmann).