Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
5-2023
Abstract
Stop words, which are considered non-predictive, are often eliminated in natural language processing tasks. However, the definition of uninformative vocabulary is vague, so most algorithms use general knowledge-based stop lists to remove stop words. There is an ongoing debate among academics about the usefulness of stop word elimination, especially in domainspecific settings. In this work, we investigate the usefulness of stop word removal in a software engineering context. To do this, we replicate and experiment with three software engineering research tools from related work. Additionally, we construct a corpus of software engineering domain-related text from 10,000 Stack Overflow questions and identify 200 domain-specific stop words using traditional information-theoretic methods. Our results show that the use of domain-specific stop words significantly improved the performance of research tools compared to the use of a general stop list and that 17 out of 19 evaluation measures showed better performance.
Keywords
Natural Language Processing (NLP), Software Engineering Documents, Stop Words
Discipline
Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
Proceedings of the 2nd Workshop on Natural Language-based Software Engineering, 2023 May 20
First Page
40
Last Page
47
ISBN
9798350301786
Identifier
10.1109/NLBSE59153.2023.00016
Publisher
IEEE
City or Country
Los Alamitos, CA
Citation
FAN, Yaohou; ARORA, Chetan; and TREUDE, Christoph.
Stop words for processing software engineering documents: Do they matter. (2023). Proceedings of the 2nd Workshop on Natural Language-based Software Engineering, 2023 May 20. 40-47.
Available at: https://ink.library.smu.edu.sg/sis_research/8912
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/NLBSE59153.2023.00016z