Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
5-2020
Abstract
Software developers use a mix of source code and natural language text to communicate with each other: Stack Overflow and Developer mailing lists abound with this mixed text. Tagging this mixed text is essential for making progress on two seminal software engineering problems — traceability, and reuse via precise extraction of code snippets from mixed text. In this paper, we borrow code-switching techniques from Natural Language Processing and adapt them to apply to mixed text to solve two problems: language identification and token tagging. Our technique, POSIT, simultaneously provides abstract syntax tree tags for source code tokens, part-of-speech tags for natural language words, and predicts the source language of a token in mixed text. To realize POSIT, we trained a biLSTM network with a Conditional Random Field output layer using abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger. POSIT improves the state-of-the-art on language identification by 10.6% and PoS/AST tagging by 23.7% in accuracy
Keywords
Code-switching, Language identification, Mixed-code, Part-of-speech tagging
Discipline
Programming Languages and Compilers | Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
Proceedings of the 42nd International Conference on Software Engineering, Seoul, South Korea, 2020, May 23-29
First Page
1348
Last Page
1358
ISBN
9781450371216
Identifier
10.1145/3377811.3380440
Publisher
ACM
City or Country
New York
Citation
PÂRȚACHI, Profir-Petru; DASH, Santanu; TREUDE, Christoph; and BARR, Earl T..
POSIT: Simultaneously tagging natural and programming languages. (2020). Proceedings of the 42nd International Conference on Software Engineering, Seoul, South Korea, 2020, May 23-29. 1348-1358.
Available at: https://ink.library.smu.edu.sg/sis_research/8907
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/3377811.3380440