CollaborEM: A self-supervised entity matching framework using multi-features collaboration
Publication Type
Journal Article
Publication Date
12-2023
Abstract
Entity Matching (EM) aims to identify whether two tuples refer to the same real-world entity and is well-known to be labor-intensive. It is a prerequisite to anomaly detection, as comparing the attribute values of two matched tuples from two different datasets provides one effective way to detect anomalies. Existing EM approaches, due to insufficient feature discovery or error-prone inherent characteristics, are not able to achieve stable performance. In this paper, we present CollaborEM, a self-supervised entity matching framework via multi-features collaboration. It is capable of (i) obtaining reliable EM results with zero human annotations and (ii) discovering adequate tuples’ features in a fault-tolerant manner. CollaborEM consists of two phases, i.e., automatic label generation (ALG) and collaborative EM training (CEMT). In the first phase, ALG is proposed to generate a set of positive tuple pairs and a set of negative tuple pairs. ALG guarantees the high quality of the generated tuples, and hence ensures the training quality of the subsequent CEMT. In the second phase, CEMT is introduced to learn the matching signals by discovering graph features and sentence features of tuples collaboratively. Extensive experimental results over eight real-world EM benchmarks show that CollaborEM outperforms all the existing unsupervised EM approaches and is comparable or even superior to the state-of-the-art supervised EM methods.
Keywords
Entity matching, sentence feature, graph feature, self-supervised, anomaly detection
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Research Areas
Data Science and Engineering
Publication
IEEE Transactions on Knowledge and Data Engineering
Volume
35
Issue
12
First Page
12139
Last Page
12152
ISSN
1041-4347
Identifier
10.1109/TKDE.2021.3134806
Publisher
Institute of Electrical and Electronics Engineers
Citation
GE, Congcong; WANG, Pengfei; CHEN, Lu; LIU, Xiaoze; ZHENG, Baihua; and GAO, Yunjun.
CollaborEM: A self-supervised entity matching framework using multi-features collaboration. (2023). IEEE Transactions on Knowledge and Data Engineering. 35, (12), 12139-12152.
Available at: https://ink.library.smu.edu.sg/sis_research/8341
Additional URL
https://doi.org/10.1109/TKDE.2021.3134806