Publication Type
Journal Article
Version
acceptedVersion
Publication Date
3-2017
Abstract
Given two object sets Q and O , a metric similarity join finds similar object pairs according to a certain criterion. This operation has a wide variety of applications in data cleaning, data mining, to name but a few. However, the rapidly growing volume of data nowadays challenges traditional metric similarity join methods, and thus, a distributed method is required. In this paper, we adopt a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure the load balancing, we present two sampling based partition methods. One utilizes the pivot and the space-filling curve mappings to cluster the data into one-dimensional space, and then selects high quality centroids to enable equal-sized partitions. The other uses the KD-tree partitioning technique to equally divide the data after the pivot mapping. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where the range-object filtering, the double-pivot filtering, the pivot filtering, and the plane sweeping techniques are utilized for pruning. Extensive experiments with both real and synthetic data sets demonstrate that our solutions outperform significantly existing state-of-the-art competitors.
Keywords
Algorithm, Similarity Joins, Metric Space, MapReduce
Discipline
Computer Sciences | Databases and Information Systems | Theory and Algorithms
Research Areas
Data Science and Engineering
Publication
IEEE Transactions on Knowledge and Data Engineering
Volume
29
Issue
3
First Page
656
Last Page
669
ISSN
1041-4347
Identifier
10.1109/TKDE.2016.2631599
Publisher
Institute of Electrical and Electronics Engineers (IEEE)
Citation
GAO, Yunjun; YANG, Keyu; CHEN, Lu; ZHENG, Baihua; CHEN, Gang; and CHEN, Chun.
Metric similarity joins using MapReduce. (2017). IEEE Transactions on Knowledge and Data Engineering. 29, (3), 656-669.
Available at: https://ink.library.smu.edu.sg/sis_research/3323
Copyright Owner and License
Authors
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1109/TKDE.2016.2631599