Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
4-2006
Abstract
Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (Feature-based Instance Selection Algorithm), is proposed to select only a subset of negative training documents for training a SVM classifier. With a smaller carefully selected training set, a SVM classifier can be more efficiently trained while delivering comparable or better classification accuracy. In our experiments on the 20-Newsgroups dataset, using only 35% negative training examples and 60% learning time, methods based on FISA delivered much better classification accuracy than those methods using all negative training documents.
Keywords
Vector support machine, Statistical analysis, Electronic discussion group, Classification, Natural language, Text, Information retrieval, Content analysis, Data analysis, Knowledge discovery, Data mining
Discipline
Databases and Information Systems | Numerical Analysis and Scientific Computing
Publication
Advances in Knowledge Discovery and Data Mining: 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12: Proceedings
Volume
3918
First Page
250
Last Page
254
ISBN
9783540332077
Identifier
10.1007/11731139_30
Publisher
Springer Verlag
City or Country
Singapore
Citation
SUN, Aixin; LIM, Ee Peng; Benatallah, Boualem; and Hassan, Mahbub.
FISA: Feature-based instance selection for imbalanced text classification. (2006). Advances in Knowledge Discovery and Data Mining: 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12: Proceedings. 3918, 250-254.
Available at: https://ink.library.smu.edu.sg/sis_research/894
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
http://doi.org/10.1007/11731139_30
Included in
Databases and Information Systems Commons, Numerical Analysis and Scientific Computing Commons