Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

6-2017

Abstract

GitHub is one of the largest and most popular repository hosting service today, having about 14 million users and more than 54 million repositories as of March 2017. This makes it an excellent platform to find projects that developers are interested in exploring. GitHub showcases its most popular projects by cataloging them manually into categories such as DevOps tools, web application frameworks, and game engines. We propose that such cataloging should not be limited only to popular projects. We explore the possibility of developing such cataloging system by automatically extracting functionality descriptive text segments from readme files of GitHub repositories. These descriptions are then input to LDA-GA, a state-of-the-art topic modeling algorithm, to identify categories. Our preliminary experiments demonstrate that additional meaningful categories which complement existing GitHub categories can be inferred. Moreover, for inferred categories that match GitHub categories, our approach can identify additional projects belonging to them. Our experimental results establish a promising direction in realizing automatic cataloging system for GitHub.

Keywords

GitHub, Latent Dirichlet Allocation, Genetic Algorithm

Discipline

Programming Languages and Compilers | Theory and Algorithms

Research Areas

Cybersecurity

Publication

EASE'17 Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering, New York, 2017 June 15-16

First Page

314

Last Page

319

ISBN

9781450348041

Identifier

10.1145/3084226.3084287

Publisher

Association for Computing Machinery

City or Country

Karlskrona

Additional URL

http://doi.org./10.1145/3084226.3084287

Share

COinS