Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
6-2017
Abstract
GitHub is one of the largest and most popular repository hosting service today, having about 14 million users and more than 54 million repositories as of March 2017. This makes it an excellent platform to find projects that developers are interested in exploring. GitHub showcases its most popular projects by cataloging them manually into categories such as DevOps tools, web application frameworks, and game engines. We propose that such cataloging should not be limited only to popular projects. We explore the possibility of developing such cataloging system by automatically extracting functionality descriptive text segments from readme files of GitHub repositories. These descriptions are then input to LDA-GA, a state-of-the-art topic modeling algorithm, to identify categories. Our preliminary experiments demonstrate that additional meaningful categories which complement existing GitHub categories can be inferred. Moreover, for inferred categories that match GitHub categories, our approach can identify additional projects belonging to them. Our experimental results establish a promising direction in realizing automatic cataloging system for GitHub.
Keywords
GitHub, Latent Dirichlet Allocation, Genetic Algorithm
Discipline
Programming Languages and Compilers | Theory and Algorithms
Research Areas
Cybersecurity
Publication
EASE'17 Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, New York, 2017 June 15-16
First Page
314
Last Page
319
ISBN
9781450348041
Identifier
10.1145/3084226.3084287
Publisher
Association for Computing Machinery
City or Country
Karlskrona
Citation
SHARMA, Abhishek; THUNG, Ferdian; KOCHHAR, Pavneet Singh; SULISTYA, Agus; and LO, David.
Cataloging GitHub repositories. (2017). EASE'17 Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, New York, 2017 June 15-16. 314-319.
Available at: https://ink.library.smu.edu.sg/sis_research/3716
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
http://doi.org./10.1145/3084226.3084287