Conference Proceeding Article
GitHub is one of the largest and most popular repository hosting service today, having about 14 million users and more than 54 million repositories as of March 2017. This makes it an excellent platform to find projects that developers are interested in exploring. GitHub showcases its most popular projects by cataloging them manually into categories such as DevOps tools, web application frameworks, and game engines. We propose that such cataloging should not be limited only to popular projects. We explore the possibility of developing such cataloging system by automatically extracting functionality descriptive text segments from readme files of GitHub repositories. These descriptions are then input to LDA-GA, a state-of-the-art topic modeling algorithm, to identify categories. Our preliminary experiments demonstrate that additional meaningful categories which complement existing GitHub categories can be inferred. Moreover, for inferred categories that match GitHub categories, our approach can identify additional projects belonging to them. Our experimental results establish a promising direction in realizing automatic cataloging system for GitHub.
GitHub, Latent Dirichlet Allocation, Genetic Algorithm
Programming Languages and Compilers | Theory and Algorithms
EASE'17 Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, New York, 2017 June 15-16
Association for Computing Machinery
City or Country
SHARMA, Abhishek; THUNG, Ferdian; KOCHHAR, Pavneet Singh; SULISTYA, Agus; and LO, David.
Cataloging GitHub repositories. (2017). EASE'17 Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, New York, 2017 June 15-16. 314-319. Research Collection School Of Information Systems.
Available at: http://ink.library.smu.edu.sg/sis_research/3716
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.