Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
10-2024
Abstract
To ensure the reliability of cloud systems, their run-time status reflecting the service quality is periodically monitored with monitoring metrics, i.e., KPIs (key performance indicators). When performance issues happen, root cause localization pinpoints the specific KPIs that are responsible for the degradation of overall service quality, facilitating prompt problem diagnosis and resolution. To this end, existing methods generally locate root-cause KPIs by identifying the KPIs that exhibit a similar anomalous trend to the overall service performance. While straightforward, solely relying on the similarity calculation may be ineffective when dealing with cloud systems with complicated interdependent services. Recent deep learning-based methods offer improved performance by modeling these intricate dependencies. However, their high computational demand often hinders their ability to meet the efficiency requirements of industrial applications. Furthermore, their lack of interpretability further restricts their practicality. To overcome these limitations, we propose KPIRoot, an effective and efficient method for root cause localization integrating both advantages of similarity analysis and causality analysis, where similarity measures the trend alignment of KPI and causality measures the sequential order of variation of KPI. Furthermore, we leverage symbolic aggregate approximation to produce a more compact representation for each KPI, enhancing the overall analysis efficiency of the approach. The experimental results show that KPIRoot outperforms seven state-of-the-art baselines by 7.9%~28.3%, while time cost is reduced by 56.9%. Moreover, we share our experience of deploying KPIRoot in the production environment of a large-scale cloud provider Cloud ${\mathcal{H}^{\ast}}$.
Discipline
Software Engineering
Research Areas
Intelligent Systems and Optimization
Areas of Excellence
Digital transformation
Publication
Proceedings of the 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE), Tsukuba, Japan, October 28-31
First Page
403
Last Page
414
Identifier
10.1109/ISSRE62328.2024.00046
Publisher
IEEE
City or Country
Pistacataway
Citation
GU, Wenwei; SUN, Xinying; LIU, Jinyang; HUO, Yintong; CHEN, Zhuangbin; ZHANG, Jianping; GU, Jiazhen; YANG, Yongqiang; and LYU, Michael R..
KPIRoot: Efficient monitoring metric-based root cause localization in large-scale cloud systems. (2024). Proceedings of the 2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE), Tsukuba, Japan, October 28-31. 403-414.
Available at: https://ink.library.smu.edu.sg/sis_research/10730
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.