Research Collection School Of Computing and Information Systems

An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval

Jing JIANG, Singapore Management UniversityFollow
ChengXiang ZHAI, University of Illinois at Urbana-Champaign

Publication Type

Journal Article

Publication Date

10-2007

Abstract

Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.

Discipline

Databases and Information Systems | Numerical Analysis and Scientific Computing

Publication

Information Retrieval

Volume

Issue

4/5

First Page

341

Last Page

363

ISSN

1386-4564

Identifier

10.1007/s10791-007-9027-7

Publisher

Elsevier

Citation

JIANG, Jing and ZHAI, ChengXiang. An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval. (2007). Information Retrieval. 10, (4/5), 341-363.
Available at: https://ink.library.smu.edu.sg/sis_research/23

Additional URL

http://dx.doi.org/10.1007/s10791-007-9027-7

Link to Full Text

Find it in your library

COinS

Research Collection School Of Computing and Information Systems

An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval

Publication Type

Publication Date

Abstract

Discipline

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Additional URL

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval

Author

Publication Type

Publication Date

Abstract

Discipline

Publication

Volume

Issue

First Page

Last Page

ISSN

Identifier

Publisher

Citation

Additional URL

Share

Search

Links

Browse

Links