TR-2023-02

ELIMINATING THE CAPACITY VARIATION PENALTY FOR CLOUD RESOURCE MANAGEMENT

Chaojie Zhang. 15 January, 2023.
Communicated by Andrew Chien.

Abstract

Increasing power grid challenges due to rapid decarbonization and pressure for reduced car- bon emissions and power cost compel data centers to operate with capacity varying in periods of hours or days, perhaps on a dynamic basis in concert with the use of renewable generation. With data centers exceeding 10% of load in many grids, the implied capacity variation may approach 50%. For today’s computing, variable resource capacity is problematic, causing severe loss in throughput and corresponding resource efficiency.

Our approach is to create intelligent resource management for variable capacity resources. Traditional resource managers were built with the assumption of constant capacity, schedul- ing jobs that fail when capacity decreases, causing abrupt job failures and wasted resources. To understand scheduling performance under variable capacity, we define three key dimen- sions of variation that lead to performance loss. We use cloud and HPC production workloads and explore the multi-dimensional capacity change space, characterizing scheduler perfor- mance in resource efficiency, job failures, and waiting time. Moreover, to improve perfor- mance, we consider scheduling techniques to cope with capacity loss. We propose intelligent termination policies to minimize job failures and wasted resource efficiency. Then, we take a broader view to prepare for capacity variation altogether. We consider two dimensions of uncertainty in capacity and workload, exploring the corresponding information space that reduces uncertainty. We propose new scheduling techniques that exploit the information to prevent job failures and increase resource efficiency.

We evaluate traditional schedulers under varying resource capacities and using a diverse set of workloads, including one HPC and three cloud workloads. Results show that capacity variation can decrease goodput by up to 60%, incurring 15-40% job failures. Amongst variability dimensions, the results show that dynamic range, structures, and change frequency are all important; each in some cases producing 10 - 40% goodput losses. Drill down with Google cloud workloads shows that variable capacity can cause serious problems, including up to 70% goodput loss, 20% job failures, and 15X increase in job wait time. Careful study of performance versus variability shows that avoiding major harm, such as goodput loss, requires a variation limit of <10% dynamic range. This prevents the cloud from significant temporal load shifting to reduce carbon emissions or power costs.

We designed and compared the performance of intelligent termination policies to cope with capacity loss considering a variety of workloads and variation traces. Our experimental results demonstrate that these new scheduling techniques achieve significant performance im- provements under resource variability, with 10 - 66% goodput increase and 1.6 - 3x job failure reduction. Using job attributes and progress to minimize wasted computation produces 44% goodput increase on average and close to full reduction on job failures. Realistic examples show that with scheduling techniques, a typical data center can achieve benefits of up to 15% carbon emission reduction and 14% power cost savings by exploiting resource capacity variations. Then, we take a broader view and design new scheduling schemes that seek to prepare for variation with Google cloud workloads which represents a hard case. These new schedulers exploit a variety of potential information about workload and capacity variation to reduce uncertainty, increasing goodput by up to 180%, decreasing job failure rate by 5 - 15X, and job waiting time by 1.4 - 4X. Within the information space, runtime classification is critical. Exploiting this information, the LongShort algorithm can drastically improve the ability to support variation in capacity from <10 to 50% while maintaining performance. These results demonstrate the promising benefits of new scheduling schemes for capacity variations but require future validation with complex workload constraints.

While capacity variation poses serious challenges to conventional resource managers, our intelligent resource management shows significant improvement, eliminating the variation penalty and demonstrating promising benefits of future variable capacity data centers.

Original Document

The original document is available in PDF (uploaded 15 January, 2023 by Andrew Chien).