Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
10-2011
Abstract
We study the problem of online classification of user generated content, with the goal of efficiently learning to categorize content generated by individual user. This problem is challenging due to several reasons. First, the huge amount of user generated content demands a highly efficient and scalable classification solution. Second, the categories are typically highly imbalanced, i.e., the number of samples from a particular useful class could be far and few between compared to some others (majority class). In some applications like spam detection, identification of the minority class often has significantly greater value than that of the majority class. Last but not least, when learning a classification model from a group of users, there is a dilemma: A single classification model trained on the entire corpus may fail to capture personalized characteristics such as language and writing styles unique to each user. On the other hand, a personalized model dedicated to each user may be inaccurate due to the scarcity of training data, especially at the very beginning; when users have written just a few articles. To overcome these challenges, we propose learning a global model over all users' data, which is then leveraged to continuously refine the individual models through a collaborative online learning approach. The class imbalance problem is addressed via a cost-sensitive learning approach. Experimental results show that our method is effective and scalable for timely classification of user generated content.
Keywords
online learning, classification, imbalanced class distribution
Discipline
Computer Sciences | Databases and Information Systems | Numerical Analysis and Scientific Computing
Research Areas
Data Science and Engineering
Publication
CIKM '11: Proceedings of the 20th ACM International Conference on Information and Knowledge Management: Glasgow, Scotland, October 24-28
First Page
285
Last Page
290
ISBN
9781450307178
Identifier
10.1145/2063576.2063622
Publisher
ACM
City or Country
New York
Citation
LI, Guangxia; CHANG, Kuiyu; HOI, Steven C. H.; LIU, Wenting; and JAIN, Ramesh.
Collaborative online learning of user generated content. (2011). CIKM '11: Proceedings of the 20th ACM International Conference on Information and Knowledge Management: Glasgow, Scotland, October 24-28. 285-290.
Available at: https://ink.library.smu.edu.sg/sis_research/2349
Copyright Owner and License
Publisher
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
https://doi.org/10.1145/2063576.2063622
Included in
Databases and Information Systems Commons, Numerical Analysis and Scientific Computing Commons