Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
5-2023
Abstract
Machine learning (ML) has gained much attention and has been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those projects to curate ML projects of high quality. The limited availability of such a high-quality dataset poses an obstacle to understanding ML projects. To help clear this obstacle, we present NICHE, a manually labelled dataset consisting of 572 ML projects. Based on the evidence of good software engineering practices, we label 441 of these projects as engineered and 131 as non-engineered. This dataset can help researchers understand the practices that are adopted in high-quality ML projects. It can also be used as a benchmark for classifiers designed to identify engineered ML projects.
Keywords
Daily lives, Engineered software project, High quality, Labeled dataset, Learning projects, Machine-learning, Open source platforms, Open source projects, Software engineering practices, Software project
Discipline
Computer and Systems Architecture | Databases and Information Systems | Software Engineering
Research Areas
Data Science and Engineering
Publication
Proceedings of the 20th IEEE/ACM International Conference on Mining Software Repositories, Melbourne, Australia, May 15-16
First Page
62
Last Page
66
ISBN
9798350311846
Identifier
10.1109/MSR59073.2023.00022
Publisher
IEEE
City or Country
New Jersey
Citation
WIDYASARI, Ratnadira; YANG, Zhou; THUNG, Ferdian; SIM, Sheng Qin; WEE, Fiona; LOK, Camellia; PHAN, Jack; QI, Haodi; TAN, Constance; LO, David; and David LO.
NICHE: A curated dataset of engineered machine learning projects in Python. (2023). Proceedings of the 20th IEEE/ACM International Conference on Mining Software Repositories, Melbourne, Australia, May 15-16. 62-66.
Available at: https://ink.library.smu.edu.sg/sis_research/8570
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/MSR59073.2023.00022
Included in
Computer and Systems Architecture Commons, Databases and Information Systems Commons, Software Engineering Commons