Publication Type

Journal Article

Version

acceptedVersion

Publication Date

1-2022

Abstract

Automatic code documentation generation has been a crucial task in the field of software engineering. It not only relieves developers from writing code documentation but also helps them to understand programs better. Specifically, deep-learning-based techniques that leverage large-scale source code corpora have been widely used in code documentation generation. These works tend to use automatic metrics (such as BLEU, METEOR, ROUGE, CIDEr, and SPICE) to evaluate different models. These metrics compare generated documentation to reference texts by measuring the overlapping words. Unfortunately, there is no evidence demonstrating the correlation between these metrics and human judgment. We conduct experiments on two popular code documentation generation tasks, code comment generation and commit message generation, to investigate the presence or absence of correlations between these metrics and human judgments. For each task, we replicate three state-of-the-art approaches and the generated documentation is evaluated automatically in terms of BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. We also ask 24 participants to rate the generated documentation considering three aspects (i.e., language, content, and effectiveness). Each participant is given Java methods or commit diffs along with the target documentation to be rated. The results show that the ranking of generated documentation from automatic metrics is different from that evaluated by human annotators. Thus, these automatic metrics are not reliable enough to replace human evaluation for code documentation generation tasks. In addition, METEOR shows the strongest correlation (with moderate Pearson correlation r about 0.7) to human evaluation metrics. However, it is still much lower than the correlation observed between different annotators (with a high Pearson correlation r about 0.8) and correlations that are reported in the literature for other tasks (e.g., Neural Machine Translation [39]). Our study points to the need to develop specialized automated evaluation metrics that can correlate more closely to human evaluation metrics for code generation tasks.

Keywords

Code Documentation Generation, Evaluation Metrics, Empirical Study

Discipline

Databases and Information Systems | Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

ACM Transactions on Software Engineering and Methodology

Volume

Issue

First Page

Last Page

ISSN

1049-331X

Identifier

10.1145/3502853

Publisher

Association for Computing Machinery (ACM)

Citation

HU, Xing; CHEN, Qiuyuan; WANG, Haoye; XIA, Xin; LO, David; and ZIMMERMANN, Thomas. Correlating automated and human evaluation of code documentation generation quality. (2022). ACM Transactions on Software Engineering and Methodology. 31, (4), 2-28.
Available at: https://ink.library.smu.edu.sg/sis_research/7664

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1145/3502853

Download

Included in

Databases and Information Systems Commons, Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Correlating automated and human evaluation of code documentation generation quality

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Correlating automated and human evaluation of code documentation generation quality

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links