Research Collection School Of Computing and Information Systems

Representation learning for Stack Overflow posts: How far are we?

Junda HE
Xin ZHOU
Bowen XU
Ting ZHANG, Singapore Management UniversityFollow
Kisub KIM, Singapore Management UniversityFollow
Zhou YANG, Singapore Management UniversityFollow
Thung Ferdian, Singapore Management UniversityFollow
IVANA CLAIRINE IRSAN, Singapore Management UniversityFollow
David LO, Singapore Management UniversityFollow

Publication Type

Journal Article

Version

publishedVersion

Publication Date

3-2024

Abstract

The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks (i.e., tag recommendation, relatedness prediction, and API recommendation). The results show that Post2Vec cannot further improve the SOTA techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, and GPT2) and (2) language models built with software engineering related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the SOTA performance significantly for all the downstream tasks.

Keywords

Computing methodologies, Knowledge representation and reasoning, Software and its engineering, Software development process management

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

ACM Transactions on Software Engineering and Methodology

Volume

Issue

First Page

Last Page

ISSN

1049-331X

Identifier

10.1145/3635711

Publisher

Association for Computing Machinery (ACM)

Citation

HE, Junda; ZHOU, Xin; XU, Bowen; ZHANG, Ting; KIM, Kisub; YANG, Zhou; Ferdian, Thung; IVANA CLAIRINE IRSAN; and David LO. Representation learning for Stack Overflow posts: How far are we?. (2024). ACM Transactions on Software Engineering and Methodology. 33, (3), 1-24.
Available at: https://ink.library.smu.edu.sg/sis_research/9232

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3635711

Download

Included in

Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Representation learning for Stack Overflow posts: How far are we?

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Representation learning for Stack Overflow posts: How far are we?

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links