Research Collection School Of Computing and Information Systems

Metric similarity joins using MapReduce

Publication Type

Journal Article

Version

acceptedVersion

Publication Date

3-2017

Abstract

Given two object sets Q and O , a metric similarity join finds similar object pairs according to a certain criterion. This operation has a wide variety of applications in data cleaning, data mining, to name but a few. However, the rapidly growing volume of data nowadays challenges traditional metric similarity join methods, and thus, a distributed method is required. In this paper, we adopt a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure the load balancing, we present two sampling based partition methods. One utilizes the pivot and the space-filling curve mappings to cluster the data into one-dimensional space, and then selects high quality centroids to enable equal-sized partitions. The other uses the KD-tree partitioning technique to equally divide the data after the pivot mapping. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where the range-object filtering, the double-pivot filtering, the pivot filtering, and the plane sweeping techniques are utilized for pruning. Extensive experiments with both real and synthetic data sets demonstrate that our solutions outperform significantly existing state-of-the-art competitors.

Keywords

Algorithm, Similarity Joins, Metric Space, MapReduce

Discipline

Computer Sciences | Databases and Information Systems | Theory and Algorithms

Research Areas

Data Science and Engineering

Publication

IEEE Transactions on Knowledge and Data Engineering

Volume

Issue

First Page

656

Last Page

669

ISSN

1041-4347

Identifier

10.1109/TKDE.2016.2631599

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Citation

GAO, Yunjun; YANG, Keyu; CHEN, Lu; ZHENG, Baihua; CHEN, Gang; and CHEN, Chun. Metric similarity joins using MapReduce. (2017). IEEE Transactions on Knowledge and Data Engineering. 29, (3), 656-669.
Available at: https://ink.library.smu.edu.sg/sis_research/3323

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

https://doi.org/10.1109/TKDE.2016.2631599

Download

Find it in your library

Included in

Databases and Information Systems Commons, Theory and Algorithms Commons

COinS

Research Collection School Of Computing and Information Systems

Metric similarity joins using MapReduce

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Metric similarity joins using MapReduce

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links