Publication Type
Journal Article
Version
publishedVersion
Publication Date
7-2014
Abstract
A big challenge in text classification is to perform classification on a large-scale and high-dimensional text corpus in the presence of imbalanced class distributions and a large number of irrelevant or noisy term features. A number of techniques have been proposed to handle this challenge with varying degrees of success. In this paper, by combining the strengths of two widely used text classification techniques, K-Nearest-Neighbor (KNN) and centroid based (Centroid) classifiers, we propose a scalable and effective flat classifier, called CenKNN, to cope with this challenge. CenKNN projects high-dimensional (often hundreds of thousands) documents into a low-dimensional (normally a few dozen) space spanned by class centroids, and then uses the \(k\)-d tree structure to find \(K\) nearest neighbors efficiently. Due to the strong representation power of class centroids, CenKNN overcomes two issues related to existing KNN text classifiers, i.e., sensitivity to imbalanced class distributions and irrelevant or noisy term features. By working on projected low-dimensional data, CenKNN substantially reduces the expensive computation time in KNN. CenKNN also works better than Centroid since it uses all the class centroids to define similarity and works well on complex data, i.e., non-linearly separable data and data with local patterns within each class. A series of experiments on both English and Chinese, benchmark and synthetic corpora demonstrates that although CenKNN works on a significantly lower-dimensional space, it performs substantially better than KNN and its five variants, and existing scalable classifiers, including Centroid and Rocchio. CenKNN is also empirically preferable to another well-known classifier, support vector machines, on highly imbalanced corpora with a small number of classes.
Keywords
Text classification, KNN, Centroid, Dimension reduction, Imbalanced classification
Discipline
Artificial Intelligence and Robotics | Databases and Information Systems
Research Areas
Intelligent Systems and Optimization
Publication
Data Mining and Knowledge Discovery
Volume
29
Issue
3
First Page
593
Last Page
265
ISSN
1384-5810
Identifier
10.1007/s10618-014-0358-x
Publisher
Springer Verlag (Germany)
Citation
PANG, Guansong; JIN, Huidong; and JIANG, Shengyi.
CenKNN: A scalable and effective text classifier. (2014). Data Mining and Knowledge Discovery. 29, (3), 593-265.
Available at: https://ink.library.smu.edu.sg/sis_research/7027
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://www.researchgate.net/profile/Guansong-Pang/publication/263890481_CenKNN_a_scalable_and_effective_text_classifier/links/0deec53c481734b43a000000/CenKNN-a-scalable-and-effective-text-classifier.pdf