Research Collection School Of Computing and Information Systems

KPIRoot+: An efficient integrated framework for anomaly detection and root cause analysis in large-scale cloud systems

Publication Type

Journal Article

Version

acceptedVersion

Publication Date

12-2025

Abstract

To ensure the reliability of cloud systems, their runtime status reflecting the service quality is periodically monitored with monitoring metrics, i.e., KPIs (key performance indicators). When performance issues happen, root cause localization pinpoints the specific KPIs that are responsible for the degradation of overall service quality, facilitating prompt problem diagnosis and resolution. To this end, existing methods generally locate root-cause KPIs by identifying the KPIs that exhibit a similar anomalous trend to the overall service performance. While straightforward, solely relying on the similarity calculation may be ineffective when dealing with cloud systems with complicated interdependent services. Recent deep learning-based methods offer improved performance by modeling these intricate dependencies. However, their high computational demand often hinders their ability to meet the efficiency requirements of industrial applications. Furthermore, their lack of interpretability further restricts their practicality. To overcome these limitations, an effective and efficient root cause localization method, KPIRoot, is proposed. It integrates both advantages of similarity analysis and causality analysis, where similarity measures the trend alignment of KPI and causality measures the sequential order of variation of KPI. Furthermore, it leverages symbolic aggregate approximation to produce a more compact representation for each KPI, enhancing the overall analysis efficiency of the approach. However, during the deployment of KPIRoot in cloud systems of a large-scale cloud system vendor, Cloud . We identified two additional drawbacks of KPIRoot: 1. The threshold-based anomaly detection method is insufficient for capturing all types of performance anomalies; 2. The SAX representation cannot capture intricate variation trends, which causes suboptimal root cause localization results. We propose KPIRoot+ to address the above drawbacks. The experimental results show that KPIRoot+ outperforms eight state-of-the-art baselines by 2.9%35.7%, while time cost is reduced by 34.7%. Moreover, we share our experience of deploying KPIRoot in the production environment of a large-scale cloud provider Cloud .

Keywords

root cause localization, cloud system reliability, cloud monitoring, cloud metrics, cloud service systems

Discipline

Artificial Intelligence and Robotics

Research Areas

Intelligent Systems and Optimization

Areas of Excellence

Digital transformation

Publication

Empirical Software Engineering

Volume

Issue

First Page

3188

Last Page

3207

ISSN

1382-3256

Identifier

10.1109/TSE.2024.3475375

Publisher

Springer

Citation

GU, Wenwei; ZHONG, Renyi; YU, Guangba; SUN, Xinying; LIU, Jinyang; HUO, Yintong; CHEN, Zhuangbin; ZHANG, Jianping; GU, Jiazhen; YANG, Yongqiang; and LYU, Michael R.. KPIRoot+: An efficient integrated framework for anomaly detection and root cause analysis in large-scale cloud systems. (2025). Empirical Software Engineering. 50, (12), 3188-3207.
Available at: https://ink.library.smu.edu.sg/sis_research/11010

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/TSE.2024.3475375

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Research Collection School Of Computing and Information Systems

KPIRoot+: An efficient integrated framework for anomaly detection and root cause analysis in large-scale cloud systems

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

KPIRoot+: An efficient integrated framework for anomaly detection and root cause analysis in large-scale cloud systems

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Areas of Excellence

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links