Research Collection School Of Computing and Information Systems

Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments

Fouad N. A. AL OMRAN
Christoph TREUDE, Singapore Management UniversityFollow

Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

5-2017

Abstract

To uncover interesting and actionable information from natural language documents authored by software developers, many researchers rely on "out-of-the-box" NLP libraries. However, software artifacts written in natural language are different from other textual documents due to the technical language used. In this paper, we first analyze the state of the art through a systematic literature review in which we find that only a small minority of papers justify their choice of an NLP library. We then report on a series of experiments in which we applied four state-of-the-art NLP libraries to publicly available software artifacts from three different sources. Our results show low agreement between different libraries (only between 60% and 71% of tokens were assigned the same part-of-speech tag by all four libraries) as well as differences in accuracy depending on source: For example, spaCy achieved the best accuracy on Stack Overflow data with nearly 90% of tokens tagged correctly, while it was clearly outperformed by Google's SyntaxNet when parsing GitHub ReadMe files. Our work implies that researchers should make an informed decision about the particular NLP library they choose and that customizations to libraries might be necessary to achieve good results when analyzing software artifacts written in natural language.

Keywords

Natural language processing, NLP libraries, Part-of-Speech tagging, Software documentation

Discipline

Software Engineering

Research Areas

Software and Cyber-Physical Systems

Publication

Proceedings of the 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR): Buenos Aires, Argentina, May 20-21

First Page

187

Last Page

197

ISBN

9781538615447

Identifier

10.1109/MSR.2017.42

Publisher

IEEE Computer Society

City or Country

Los Alamitos, CA

Citation

AL OMRAN, Fouad N. A. and TREUDE, Christoph. Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments. (2017). Proceedings of the 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR): Buenos Aires, Argentina, May 20-21. 187-197.
Available at: https://ink.library.smu.edu.sg/sis_research/8850

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/MSR.2017.42

Download

Included in

Software Engineering Commons

COinS

Research Collection School Of Computing and Information Systems

Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Choosing an NLP library for analyzing software documentation: A systematic literature review and a series of experiments

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links