Publication Type
Conference Proceeding Article
Version
acceptedVersion
Publication Date
3-2023
Abstract
GitHub is one of the most popular platforms forversion control and collaboration. In GitHub, developers are ableto assign related topics to their repositories, which is helpfulfor finding similar repositories. The topics that are assigned torepositories are varied and provide salient descriptions of therepository; some topics describe the technology employed in aproject, while others describe functionality of the project, itsgoals, and its features. Topics are part of the metadata of arepository and are useful for the organization and discoverabilityof the repository. However, the number of topics is large andthis makes it challenging to assign a relevant set of topics to arepository. While prior studies filter out infrequently occurringtopics before their experiments, we find that these topics formthe majority of the data.In this study, we try to address the problem of identifying thetopics from a GitHub repository by treating it as an extrememulti-label learning (XML) problem. We collect data of 21KGitHub repositories containing 37K labels of topics. The mainchallenge for XML is a large number of possible labels andsevere data sparsity which fit the challenge of identification oftopics from the GitHub repository. We evaluate multiple XMLtechniques, such as Parabel, Bonsai, LightXML, and ZestXML.We then perform an analysis of the different models proposed forXML classification. The best results on all the metrics from XMLmodels are from ZestXML which is a combination of zero-shotand XML. We also compare the performance of ZestXML witha baseline from a recent study. The results show that ZestXMLimproves the baseline in terms of the average F1-score by 17.35%.We also find that for the repositories that have topics thatrarely appear in the repositories used during training, ZestXMLimproves the performance greatly. The average of F1-score is 3times higher as compared to the baseline for the topics with 20or less occurrences in training data.
Keywords
Multi-label classification, Extreme multi-label learning, Topic recommendation, GitHub repositories
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Research Areas
Data Science and Engineering; Information Systems and Management; Intelligent Systems and Optimization
Publication
IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2023
ISBN
9781665452786
Identifier
10.1109/SANER56733.2023.00025
City or Country
Taipa, Macao
Citation
WIDYASARI, Ratnadira; ZHAO, Zhipeng; CONG, Thanh Le; KANG, Hong Jin; and LO, David.
Topic recommendation for GitHub repositories: How far can extreme multi-label learning go?. (2023). IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2023.
Available at: https://ink.library.smu.edu.sg/sis_research/8576
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.