Publication Type
Journal Article
Version
submittedVersion
Publication Date
11-2023
Abstract
With the rising awareness of data assets, data governance, which is to understand where data comes from, how it is collected, and how it is used, has been assuming evergrowing importance. One critical component of data governance gaining increasing attention is auditing machine learning models to determine if specific data has been used for training. Existing auditing techniques, like shadow auditing methods, have shown feasibility under specific conditions such as having access to label information and knowledge of training protocols. However, these conditions are often not met in most real-world applications. In this paper, we introduce a practical framework for auditing data provenance based on a differential mechanism, i.e., after carefully designed transformation, perturbed input data from the target model's training set would result in much more drastic changes in the output than those from the model's non-training set. Our framework is data-dependent and does not require distinguishing training data from non-training data or training additional shadow models with labeled output data. Furthermore, our framework extends beyond point-based data auditing to group-based data auditing, aligning with the needs of real-world applications. Our theoretical analysis of the differential mechanism and the experimental results on real-world data sets verify the proposal's effectiveness. The codes have been uploaded in an anonymous link.
Keywords
Data models, training data, biological system modeling, computational modeling, predictive models, machine learning
Discipline
Databases and Information Systems | Data Storage Systems | Numerical Analysis and Scientific Computing
Research Areas
Data Science and Engineering
Publication
IEEE Transactions on Knowledge and Data Engineering
First Page
1
Last Page
12
ISSN
1041-4347
Identifier
10.1109/TKDE.2023.3334821
Publisher
IEEE
Citation
MU, Xin; PANG, Ming; and ZHU, Feida.
Data provenance via differential auditing. (2023). IEEE Transactions on Knowledge and Data Engineering. 1-12.
Available at: https://ink.library.smu.edu.sg/sis_research/7808
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Additional URL
https://doi.org/10.1109/TKDE.2023.3334821
Included in
Databases and Information Systems Commons, Data Storage Systems Commons, Numerical Analysis and Scientific Computing Commons