Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
6-2006
Abstract
Named Entity Recognition (NER) is a fundamental task in text mining and natural language understanding. Current approaches to NER (mostly based on supervised learning) perform well on domains similar to the training domain, but they tend to adapt poorly to slightly different domains. We present several strategies for exploiting the domain structure in the training data to learn a more robust named entity recognizer that can perform well on a new domain. First, we propose a simple yet effective way to automatically rank features based on their generalizabilities across domains. We then train a classifier with strong emphasis on the most generalizable features. This emphasis is imposed by putting a rank-based prior on a logistic regression model. We further propose a domain-aware cross validation strategy to help choose an appropriate parameter for the rank-based prior. We evaluated the proposed method with a task of recognizing named entities (genes) in biology text involving three species. The experiment results show that the new domain-aware approach outperforms a state-of-the-art baseline method in adapting to new domains, especially when there is a great difference between the new domain and the training domain.
Discipline
Databases and Information Systems | Numerical Analysis and Scientific Computing
Publication
HLT-NAACL '06: Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
First Page
74
Last Page
81
Identifier
10.3115/1220835.1220845
Publisher
ACL
City or Country
New York City, NY, USA
Citation
JIANG, Jing and ZHAI, ChengXiang.
Exploiting Domain Structure for Named Entity Recognition. (2006). HLT-NAACL '06: Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. 74-81.
Available at: https://ink.library.smu.edu.sg/sis_research/1255
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
http://dx.doi.org/10.3115/1220835.1220845
Included in
Databases and Information Systems Commons, Numerical Analysis and Scientific Computing Commons