Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
9-2024
Abstract
Slow task detection is a critical problem in cloud operation and maintenance since it is highly related to user experience and can bring substantial liquidated damages. Most anomaly detection methods detect it from a single-task aspect. However, considering millions of concurrent tasks in large-scale cloud computing clusters, it becomes impractical and inefficient. Moreover, single-task slowdowns are very common and do not necessarily indicate a malfunction of a cluster due to its violent fluctuation nature in a virtual environment. Thus, we shift our attention to cluster-wide task slowdowns by utilizing the duration time distribution of tasks across a cluster, so that the computation complexity is not relevant to the number of tasks. The task duration time distribution often exhibits compound periodicity and local exceptional fluctuations over time. Though transformer-based methods are one of the most powerful methods to capture these time series normal variation patterns, we empirically find and theoretically explain the flaw of the standard attention mechanism in reconstructing subperiods with low amplitude when dealing with compound periodicity. To tackle these challenges, we propose SORN (i.e., Skimming Off subperiods in descending amplitude order and Reconstructing Non-slowing fluctuation), which consists of a Skimming Attention mechanism to reconstruct the compound periodicity and a Neural Optimal Transport module to distinguish cluster-wide slowdowns from other exceptional fluctuations. Furthermore, since anomalies in the training set are inevitable in a practical scenario, we propose a picky loss function, which adaptively assigns higher weights to reliable time slots in the training set. Extensive experiments demonstrate that SORN outperforms state-of-the-art methods on multiple real-world industrial datasets.
Keywords
Task slowdown detection, Time series, Unsupervised anomaly detection, AIOps, Anomaly detection, Cloud computing, Slow task detection
Discipline
Databases and Information Systems
Research Areas
Information Systems and Management; Intelligent Systems and Optimization
Publication
Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2024) : Barcelona, Spain, August 25-29
First Page
266
Last Page
277
Identifier
10.1145/3637528.3671936
Publisher
Association for Computing Machinery
City or Country
Barcelona, Spain
Citation
CHEN, Feiyi; ZHANG, Yingying; FAN, Lunting; LIANG, Yuxuan; PANG, Guansong; WEN, Qingsong; and DENG, Shuiguang.
Cluster-wide task slowdown detection in cloud system. (2024). Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2024) : Barcelona, Spain, August 25-29. 266-277.
Available at: https://ink.library.smu.edu.sg/sis_research/9755
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3637528.3671936