Research Collection School Of Computing and Information Systems

Accelerating Sequence Searching: Dimensionality Reduction Method

Publication Type

Journal Article

Version

submittedVersion

Publication Date

9-2009

Abstract

Similarity search over long sequence dataset becomes increasingly popular in many emerging applications, such as text retrieval, genetic sequences exploring, etc. In this paper, a novel index structure, namely Sequence Embedding Multiset tree (SEM − tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.

Keywords

Sequence similarity search, Sequence embedding, Index, Dimension reduction

Discipline

Computer Sciences | Databases and Information Systems

Publication

Knowledge and Information Systems

Volume

Issue

First Page

301

Last Page

322

ISSN

0219-1377

Identifier

10.1007/s10115-008-0180-0

Publisher

Springer Verlag

Citation

SONG, Guojie; CUI, Bin; ZHENG, Baihua; XIE, Kunqing; and YANG, Dongqing. Accelerating Sequence Searching: Dimensionality Reduction Method. (2009). Knowledge and Information Systems. 20, (3), 301-322.
Available at: https://ink.library.smu.edu.sg/sis_research/750

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

http://dx.doi.org/10.1007/s10115-008-0180-0

Download

Find it in your library

Included in

Databases and Information Systems Commons

COinS

Research Collection School Of Computing and Information Systems

Accelerating Sequence Searching: Dimensionality Reduction Method

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Accelerating Sequence Searching: Dimensionality Reduction Method

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links