Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
12-2025
Abstract
The nexus between data characteristics and parametric models is fundamental for developing effective and reliable artificial intelligence (AI) systems. Mismatches in data properties for model development may lead to deleterious effects on AI model performance in machine learning practice. This paper proposes a Reliable Data Split (RDS) procedure to learn how to select data points that will generalise the target domain adequately by employing prior knowledge of the data generative process. We introduce a reinforced selection strategy using deep reinforcement learning with diverse black box predictors in maximising ensemble rewards as the proxy of model performance potential while maintaining an appropriate proportionate allocation and the independent and identically distributed (i.i.d.) assumption. A comprehensive evaluation of the RDS procedure is conducted on four real-world datasets, including Madelon, Drug Reviews, MNIST, and Kalapa Credit Scoring Challenge, with coverage of machine learning tasks such as binary classification, multi-class classification, and regression on multivariate, textual, and visual data. The experimental results evidently demonstrate consistent performance improvements of trainable data samples over classical or prior data selection. Hence, we advocate the use of RDS for data splitting in the early stage of machine learning tasks for parameter tuning, model selection and overfitting prevention, as well as, sampling in large-scale AI competitions for searching the best possible and shift-stable solutions.
Keywords
Artificial intelligence systems; Characteristic model; Data characteristics; Data properties; Learning tasks; Machine-learning; Model development; Model potential; Modeling performance; Parametric models
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Publication
Proceedings of Machine Learning Research: Reliable and Trustworthy Artificial Intelligence Workshop at 17th Asian Conference on Machine Learning, ACML 2025, Taipei, December 12
Volume
310
First Page
73
Last Page
89
Publisher
ML Research Press
City or Country
Taipei
Citation
Nguyen, Hoang D.; Vu, Xuan-Son; TRUONG, Quoc Tuan; and Le, Duc-Trong.
Reliable-Data-Split (RDS): Maximizing model potential with reinforced selection strategy. (2025). Proceedings of Machine Learning Research: Reliable and Trustworthy Artificial Intelligence Workshop at 17th Asian Conference on Machine Learning, ACML 2025, Taipei, December 12. 310, 73-89.
Available at: https://ink.library.smu.edu.sg/sis_research/11029
Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://proceedings.mlr.press/v310/nguyen25c.html