Research Collection School Of Computing and Information Systems

Negative Factor: Improving Regular-Expression Matching in Strings

Xiaochun YANG, Northeastern University
Tao QIU, Northeastern University
Bin WANG, Northeastern University
Baihua ZHENG, Singapore Management UniversityFollow
Yaoshu WANG, Northeastern University
Chen LI, University of California, Irvine

Publication Type

Journal Article

Version

acceptedVersion

Publication Date

2-2016

Abstract

The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.

Keywords

long sequence, performance, regular expression, algorithms, automata, patterns, search

Discipline

Databases and Information Systems | Theory and Algorithms

Research Areas

Data Science and Engineering

Publication

ACM Transactions on Database Systems

Volume

Issue

First Page

Last Page

ISSN

0362-5915

Identifier

10.1145/2847525

Publisher

Association for Computing Machinery (ACM)

Citation

YANG, Xiaochun; QIU, Tao; WANG, Bin; ZHENG, Baihua; WANG, Yaoshu; and LI, Chen. Negative Factor: Improving Regular-Expression Matching in Strings. (2016). ACM Transactions on Database Systems. 40, (4), 1-46.
Available at: https://ink.library.smu.edu.sg/sis_research/3157

Copyright Owner and License

Authors

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Comments

Conference paper version 2013, SIGMOD '13 Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, http://dx.doi.org/10.1145/2463676.2465289

Additional URL

https://doi.org/10.1145/2847525

Download

Find it in your library

Included in

Databases and Information Systems Commons, Theory and Algorithms Commons

COinS

Research Collection School Of Computing and Information Systems

Negative Factor: Improving Regular-Expression Matching in Strings

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Comments

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

Negative Factor: Improving Regular-Expression Matching in Strings

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Research Areas

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Copyright Owner and License

Creative Commons License

Comments

Additional URL

Included in

Share

Search

Links

Browse

Links