Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

12-2025

Abstract

The nexus between data characteristics and parametric models is fundamental for developing effective and reliable artificial intelligence (AI) systems. Mismatches in data properties for model development may lead to deleterious effects on AI model performance in machine learning practice. This paper proposes a Reliable Data Split (RDS) procedure to learn how to select data points that will generalise the target domain adequately by employing prior knowledge of the data generative process. We introduce a reinforced selection strategy using deep reinforcement learning with diverse black box predictors in maximising ensemble rewards as the proxy of model performance potential while maintaining an appropriate proportionate allocation and the independent and identically distributed (i.i.d.) assumption. A comprehensive evaluation of the RDS procedure is conducted on four real-world datasets, including Madelon, Drug Reviews, MNIST, and Kalapa Credit Scoring Challenge, with coverage of machine learning tasks such as binary classification, multi-class classification, and regression on multivariate, textual, and visual data. The experimental results evidently demonstrate consistent performance improvements of trainable data samples over classical or prior data selection. Hence, we advocate the use of RDS for data splitting in the early stage of machine learning tasks for parameter tuning, model selection and overfitting prevention, as well as, sampling in large-scale AI competitions for searching the best possible and shift-stable solutions.

Keywords

Artificial intelligence systems; Characteristic model; Data characteristics; Data properties; Learning tasks; Machine-learning; Model development; Model potential; Modeling performance; Parametric models

Discipline

Artificial Intelligence and Robotics | Databases and Information Systems

Publication

Proceedings of Machine Learning Research: Reliable and Trustworthy Artificial Intelligence Workshop at 17th Asian Conference on Machine Learning, ACML 2025, Taipei, December 12

Volume

310

First Page

73

Last Page

89

Publisher

ML Research Press

City or Country

Taipei

Additional URL

https://proceedings.mlr.press/v310/nguyen25c.html

Share

COinS