TR-2020-14
Causal and Interpretable Learning for Datacenter Latency Prediction
Yi Ding; Avinash Rao; Henry Hoffmann. 20 December, 2020.
Communicated by Henry Hoffmann.
Abstract
Stragglers---computations that exhibit extreme tail
latencies---present a major challenge to delivering predictable
performance in datacenters. Accurately predicting
stragglers would enable efficient, proactive intervention.
While a number of approaches use machine learning to
predict computer system performance, they routinely rely on
carefully curated training sets. Assembling the right training
set requires prior knowledge, i.e., sufficient examples of all
possible behaviors to be identified, which is challenging
when new workloads could be unique and unlike any
workload from training. To make accurate predictions with
neither prior knowledge nor carefully curated datasets this
paper presents Agatha, a straggler prediction framework that
augments existing machine learning approaches with causal
analysis. Agatha weights predictions using a statistical
method—called propensity scoring—to account for a lack
of stragglers in training data. Additionally, Agatha exploits
model-agnostic interpretation to gain insights into the
reasons for straggling behavior. We evaluate Agatha on
datacenter traces from Google and Alibaba and find that,
compared to prior work that is heavily reliant on the training
set, Agatha (1) produces much more accurate predictions,
(2) does so much earlier in job execution, (3) provides
interpretable results, and (4) reduces completion time.
Agatha achieves these improved predictions even with no
examples of stragglers in its training set, making it better
suited for practical deployments.
Original Document
The original document is available in PDF (uploaded 20 December, 2020 by
Henry Hoffmann).