Squeezing Long Sequence Data for Efficient Similarity Search
Publication Type
Conference Proceeding Article
Publication Date
3-2008
Abstract
Similarity search over long sequence dataset becomes increasingly popular in many emerging applications. In this paper, a novel index structure, namely Sequence Embedding Multiset tree(SEM-tree), has been proposed to speed up the searching process over long sequences. The SEM-tree is a multi-level structure where each level represents the sequence data with different compression level of multiset, and the length of multiset increases towards the leaf level which contains original sequences. The multisets, obtained using sequence embedding algorithms, have the desirable property that they do not need to keep the character order in the sequence, i.e. shorter representation, but can reserve the majority of distance information of sequences. Each level of the tree serves to prune the search space more efficiently as the multisets utilize the predicability to finish the searching process beforehand and reduce the computational cost greatly. A set of comprehensive experiments are conducted to evaluate the performance of the SEM-tree, and the experimental results show that the proposed method is much more efficient than existing representative methods.
Discipline
Computer Sciences
Publication
10th Asia Pacific Web Conference (APWeb'08)
First Page
438
Last Page
449
Identifier
10.1007/978-3-540-78849-2_44
Publisher
Springer Verlag
Citation
SONG, Guojie; Cui, Bin; ZHENG, Baihua; Xie, Kunqing; and YANG, Dongqing.
Squeezing Long Sequence Data for Efficient Similarity Search. (2008). 10th Asia Pacific Web Conference (APWeb'08). 438-449.
Available at: https://ink.library.smu.edu.sg/sis_research/405
Additional URL
http://dx.doi.org/10.1007/978-3-540-78849-2_44
Comments
4976/2008