Publication Type

Conference Proceeding Article

Version

publishedVersion

Publication Date

4-2006

Abstract

Support Vector Machines (SVM) classifiers are widely used in text classification tasks and these tasks often involve imbalanced training. In this paper, we specifically address the cases where negative training documents significantly outnumber the positive ones. A generic algorithm known as FISA (Feature-based Instance Selection Algorithm), is proposed to select only a subset of negative training documents for training a SVM classifier. With a smaller carefully selected training set, a SVM classifier can be more efficiently trained while delivering comparable or better classification accuracy. In our experiments on the 20-Newsgroups dataset, using only 35% negative training examples and 60% learning time, methods based on FISA delivered much better classification accuracy than those methods using all negative training documents.

Keywords

Vector support machine, Statistical analysis, Electronic discussion group, Classification, Natural language, Text, Information retrieval, Content analysis, Data analysis, Knowledge discovery, Data mining

Discipline

Databases and Information Systems | Numerical Analysis and Scientific Computing

Publication

Advances in Knowledge Discovery and Data Mining: 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12: Proceedings

Volume

3918

First Page

250

Last Page

254

ISBN

9783540332077

Identifier

10.1007/11731139_30

Publisher

Springer Verlag

City or Country

Singapore

Citation

SUN, Aixin; LIM, Ee Peng; Benatallah, Boualem; and Hassan, Mahbub. FISA: Feature-based instance selection for imbalanced text classification. (2006). Advances in Knowledge Discovery and Data Mining: 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12: Proceedings. 3918, 250-254.
Available at: https://ink.library.smu.edu.sg/sis_research/894

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.

Additional URL

http://doi.org/10.1007/11731139_30

Download

Included in

Databases and Information Systems Commons, Numerical Analysis and Scientific Computing Commons

COinS

Research Collection School Of Computing and Information Systems

FISA: Feature-based instance selection for imbalanced text classification

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Publication

Volume

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Search

Links

Browse

Links

Research Collection School Of Computing and Information Systems

FISA: Feature-based instance selection for imbalanced text classification

Author

Publication Type

Version

Publication Date

Abstract

Keywords

Discipline

Publication

Volume

First Page

Last Page

ISBN

Identifier

Publisher

City or Country

Citation

Creative Commons License

Additional URL

Included in

Share

Search

Links

Browse

Links