Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
9-2007
Abstract
Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, including accuracy, freshness and completeness, to capture the common sources of data quality degradation. We identify and focus attention on a novel measure, column heterogeneity, that seeks to quantify the data quality problems that can arise when merging data from different sources. We identify desiderata that a column heterogeneity measure should intuitively satisfy, and discuss a promising direction of research to quantify database column heterogeneity based on using a novel combination of cluster entropy and soft clustering. Finally, we present a few preliminary experimental results, using diverse data sets of semantically different types, to demonstrate that this approach appears to provide a robust mechanism for identifying and quantifying database column heterogeneity.
Discipline
Databases and Information Systems
Research Areas
Data Science and Engineering
Publication
Proceedings of the first international VLDB workshop on Clean Databases, Seoul, Korea, 2006 September 11
First Page
1
Last Page
4
Publisher
VLDB Endowment
City or Country
Stanford, CA
Citation
DAI, Bing Tian; KOUDAS, Nick; OOI, Beng Chin; SRIVASTAVA, Divesh; and VENKATASUBRAMANIAN, Suresh.
Column heterogeneity as a measure of data quality. (2007). Proceedings of the first international VLDB workshop on Clean Databases, Seoul, Korea, 2006 September 11. 1-4.
Available at: https://ink.library.smu.edu.sg/sis_research/4165
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.