Publication Type

Conference Proceeding Article

Version

acceptedVersion

Publication Date

3-2023

Abstract

GitHub is one of the most popular platforms forversion control and collaboration. In GitHub, developers are ableto assign related topics to their repositories, which is helpfulfor finding similar repositories. The topics that are assigned torepositories are varied and provide salient descriptions of therepository; some topics describe the technology employed in aproject, while others describe functionality of the project, itsgoals, and its features. Topics are part of the metadata of arepository and are useful for the organization and discoverabilityof the repository. However, the number of topics is large andthis makes it challenging to assign a relevant set of topics to arepository. While prior studies filter out infrequently occurringtopics before their experiments, we find that these topics formthe majority of the data.In this study, we try to address the problem of identifying thetopics from a GitHub repository by treating it as an extrememulti-label learning (XML) problem. We collect data of 21KGitHub repositories containing 37K labels of topics. The mainchallenge for XML is a large number of possible labels andsevere data sparsity which fit the challenge of identification oftopics from the GitHub repository. We evaluate multiple XMLtechniques, such as Parabel, Bonsai, LightXML, and ZestXML.We then perform an analysis of the different models proposed forXML classification. The best results on all the metrics from XMLmodels are from ZestXML which is a combination of zero-shotand XML. We also compare the performance of ZestXML witha baseline from a recent study. The results show that ZestXMLimproves the baseline in terms of the average F1-score by 17.35%.We also find that for the repositories that have topics thatrarely appear in the repositories used during training, ZestXMLimproves the performance greatly. The average of F1-score is 3times higher as compared to the baseline for the topics with 20or less occurrences in training data.

Keywords

Multi-label classification, Extreme multi-label learning, Topic recommendation, GitHub repositories

Discipline

Artificial Intelligence and Robotics | Databases and Information Systems

Research Areas

Data Science and Engineering; Information Systems and Management; Intelligent Systems and Optimization

Publication

IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2023

ISBN

9781665452786

Identifier

10.1109/SANER56733.2023.00025

City or Country

Taipa, Macao

Share

COinS