Publication Type

PhD Dissertation

Version

publishedVersion

Publication Date

9-2025

Abstract

The increasing scale of real-world graphs in domains such as fraud detection, community detection, and biological analysis demands high-throughput, memory-efficient graph processing solutions. GPUs offer massive parallelism for accelerating such workloads, and numerous frameworks have been developed to leverage their computational power. These frameworks primarily focus on optimizing scheduling to better align graph processing with GPU architectures. It performs well for algorithms with low memory demands, such as BFS, SSSP, and PageRank. However, for algorithms that require substantial memory, such as label propagation, and subgraph counting, the limited memory capacity of GPUs often becomes a significant bottleneck.

This dissertation addresses the challenge of reducing intermediate memory overhead in GPU-based graph processing through sketching, compression, and sampling techniques. These techniques are well-suited to a wide range of algorithms. In this dissertation, we select two representative examples, label propagation and subgraph counting, to demonstrate our contributions. Both algorithms are fundamental in their respective domains: label propagation is one of the most basic techniques for community detection, while subgraph counting is a core task in subgraph matching. Although both methods produce large intermediate results, the nature of these results differs. Therefore, we adopt sketching for label propagation and sampling for subgraph counting.

Our first contribution is GLP (GPU-accelerated Label Propagation), a GPU-accelerated label propagation framework that targets the problem of frequent label tracking under tight memory constraints. GLP incorporates Count-Min Sketch and compact GPU hash tables to track most frequent labels, significantly reducing the required memory. We propose novel GPU-centric optimizations by leveraging the community as well as power-law properties of large graphs. GLP is evaluated in a real-world fraud detection system and demonstrates high accuracy and runtime efficiency, making it suitable for time-critical commercial applications.

Building upon this foundation, the second one introduces CGLP+ (Compressed GPU-accelerated Label Propagation), a memory-optimized framework for label propagation over compressed graphs. Traditional GPU frameworks assume full decompression of graph data prior to computation, which is often infeasible for large graphs. CGLP+ addresses this by enabling partial decompression and in-place computation on compressed adjacency lists, thereby eliminating the need to load the entire graph into memory. We propose a novel decoding strategy that achieve significant compression ratios while maintaining low decompression overhead. Experimental results show that CGLP+ achieves an average of 88.4% reduction in CPU-GPU transfer cost, enabling efficient processing of large graphs.

The final contribution is gSWORD, a GPU-based subgraph counting framework. While GLP and CGLP leverage sketching and compression techniques to reduce memory consumption for label propagation task, where the intermediate results are frequency counts, gSWORD targets the subgraph counting task, where the intermediate results consist of partial instances rather than numeric values. Thus sampling methods are developed to estimate the number of instances. This framework leverages the massive parallelism of GPUs to accelerate iterative sampling algorithms for subgraph counting. Despite the embarrassingly parallel nature of the samples, there are unique challenges in accelerating subgraph counting due to its irregular computation logic.

To address these challenges, we introduce two GPU-centric optimizations: (1) sample inheritance, enabling threads to inherit samples from neighboring threads to avoid idling, and (2) warp streaming, effectively distributing workloads among threads through a streaming process. Moreover, we propose a CPU-GPU co-processing pipeline that overlaps the sampling and enumeration processes to mitigate the underestimation issue. Experimental results demonstrate that deploying state-of-the-art sampling algorithms on gSWORD can perform millions of samples per second. The co-processing pipeline substantially improves the estimation accuracy in the cases where existing methods encounter severe underestimations with negligible overhead.

Together, these contributions advance memory-efficient graph processing on GPUs, enabling more scalable, real-time processing with strong performance in both memory use and speed across diverse datasets.

Keywords

Graph processing, GPU computing

Degree Awarded

BSc (Computer Science)

Discipline

Databases and Information Systems

Supervisor(s)

LI, Yuchen

First Page

1

Last Page

133

Publisher

Singapore Management University

City or Country

Singapore

Copyright Owner and License

Author

Share

COinS