Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
5-2019
Abstract
Stack Overflow (SO) is the most popular questionand-answer website for software developers, providing a large amount of copyable code snippets. Like other software artifacts, code on SO evolves over time, for example when bugs are fixed or APIs are updated to the most recent version. To be able to analyze how code and the surrounding text on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text and code blocks. It connects code snippets from SO posts to other platforms by aggregating URLs from surrounding text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution and maintenance of code on SO and its relation to other platforms such as GitHub.
Keywords
Code snippets, Github, Open dataset, Software evolution, Stack overflow
Discipline
Programming Languages and Compilers | Software Engineering
Research Areas
Software and Cyber-Physical Systems
Publication
Proceedings of the 16th International Conference on Mining Software Repositories, Montreal, Canada, 2019 May 26-27
First Page
191
Last Page
194
ISBN
9781728134123
Identifier
10.1109/MSR.2019.00038
Publisher
IEEE Computer Society
City or Country
Piscataway, NJ
Citation
BALTES, Sebastian; TREUDE, Christoph; and DIEHL, Stephan.
SOTorrent: Studying the origin, evolution, and usage of stack overflow code snippets. (2019). Proceedings of the 16th International Conference on Mining Software Repositories, Montreal, Canada, 2019 May 26-27. 191-194.
Available at: https://ink.library.smu.edu.sg/sis_research/8837
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/MSR.2019.00038