How to better utilize code graphs in semantic code search?
Publication Type
Conference Proceeding Article
Publication Date
11-2022
Abstract
Semantic code search greatly facilitates software reuse, which enables users to find code snippets highly matching user-specified natural language queries. Due to the rich expressive power of code graphs (e.g., control-flow graph and program dependency graph), both of the two mainstream research works (i.e., multi-modal models and pre-trained models) have attempted to incorporate code graphs for code modelling. However, they still have some limitations: First, there is still much room for improvement in terms of search effectiveness. Second, they have not fully considered the unique features of code graphs.In this paper, we propose a Graph-to-Sequence Converter, namely G2SC. Through converting the code graphs into lossless sequences, G2SC enables to address the problem of small graph learning using sequence feature learning and capture both the edges and nodes attribute information of code graphs. Thus, the effectiveness of code search can be greatly improved. In particular, G2SC first converts the code graph into a unique corresponding node sequence by a specific graph traversal strategy. Then, it gets a statement sequence by replacing each node with its corresponding statement. A set of carefully designed graph traversal strategies guarantee that the process is one-to-one and reversible. G2SC enables capturing rich semantic relationships (i.e., control flow, data flow, node/relationship properties) and provides learning model-friendly data transformation. It can be flexibly integrated with existing models to better utilize the code graphs. As a proof-of-concept application, we present two G2SC enabled models: GSMM (G2SC enabled multi-modal model) and GSCodeBERT (G2SC enabled CodeBERT model). Extensive experiment results on two real large-scale datasets demonstrate that GSMM and GSCodeBERT can greatly improve the state-of-the-art models MMAN and GraphCodeBERT by 92% and 22% on R@1, and 63% and 11.5% on MRR, respectively.
Discipline
Databases and Information Systems
Research Areas
Data Science and Engineering
Publication
Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, Singapore, 2022 November 14-18
First Page
722
Last Page
733
Identifier
10.1145/3540250.3549087
Publisher
Association for Computing Machinery
City or Country
New York
Citation
SHI, Yucen; YIN, Ying; WANG, Zhengkui; LO, David; ZHANG, Tao; XIA, Xin; ZHAO, Yuhai; and XU, Bowen.
How to better utilize code graphs in semantic code search?. (2022). Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Singapore, Singapore, 2022 November 14-18. 722-733.
Available at: https://ink.library.smu.edu.sg/sis_research/7734
Additional URL
https://doi.org/10.1145/3540250.3549087