Performance ticket handling is an expensive operation in data centers, where physical boxes host multiple virtual machines (VMs). A large body of tickets arise from resource usage warnings, e.g., CPU and RAM usages that exceed predefined thresholds. The transient nature of CPU and RAM usage as well as their strong correlation across time among co-located VMs within boxes drastically increase the complexity of ticket management. Based on large resource usage data collected from production data centers, with 6K physical boxes and more than 80K VMs, we first discover patterns of spatial and temporal dependencies among/within the usage series of co-located resources. Leveraging our key findings, we develop an active ticket managing (ATM) system that aims to drastically reduce usage tickets. ATM consists of: 1) a spatial-Temporal dependency-based time series prediction methodology and 2) a proactive capacity planning policy for CPU and RAM resources for VMs co-located within a box and boxes within a single data center client, that aims to drastically reduce usage tickets. ATM exploits the spatial-Temporal dependency across/within multiple resources of co-located VMs and single-client boxes for usage prediction, and then actuates proactive capacity planning. Evaluation results on traces of 6K physical boxes from operating data centers show that ATM is able to provide accurate prediction of usage series in cloud data centers with low computational overhead. At the same time ATM achieves significant ticket reduction up to 60% for both VM and box usage series.
|Number of pages||14|
|Journal||IEEE Transactions on Network and Service Management|
|Publication status||Published - 1 Mar 2018|
- capacity planning
- Cloud data center
- reliability analysis
- spatial-Temporal prediction