CrowdLink: An Error-Tolerant Model for Linking Complex Records
Publication Type
Conference Proceeding Article
Publication Date
5-2015
Abstract
Record linkage (RL) refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases), which is a long-standing challenge in database management. Algorithmic approaches have been proposed to improve RL quality, but remain far from perfect. Crowdsourcing offers a more accurate but expensive (and slow) way to bring human insight into the process. In this paper, we propose a new probabilistic model, namely CrowdLink, to tackle the above limitations. In particular, our model gracefully handles the crowd error and the correlation among different pairs, as well as enables us to decompose the records into small pieces (i.e. attributes) so that crowdsourcing workers can easily verify. Further, we develop efficient and effective algorithms to select the most valuable questions, in order to reduce the monetary cost of crowdsourcing. We conducted extensive experiments on both synthetic and real-world datasets. The experimental results verified the effectiveness and the applicability of our model.
Discipline
Databases and Information Systems
Publication
ExploreDB '15 Proceedings of the Second International Workshop on Exploratory Search in Databases and the Web
First Page
15
Last Page
20
ISBN
9781450337403
Identifier
10.1145/2795218.2795222
Publisher
ACM
City or Country
New York, NY, USA
Citation
ZHANG, Chen Jason; MENG, Rui; CHEN, Lei; and ZHU, Feida.
CrowdLink: An Error-Tolerant Model for Linking Complex Records. (2015). ExploreDB '15 Proceedings of the Second International Workshop on Exploratory Search in Databases and the Web. 15-20.
Available at: https://ink.library.smu.edu.sg/sis_research/3136