Bursty feature representation for clustering text streams
Publication Type
Conference Proceeding Article
Publication Date
4-2007
Abstract
Text representation plays a crucial role in classical text mining, where the primary focus was on static text. Nevertheless, well-studied static text representations including TFIDF are not optimized for non-stationary streams of information such as news, discussion board messages, and blogs. We therefore introduce a new temporal representation for text streams based on bursty features. Our bursty text representation differs significantly from traditional schemes in that it 1) dynamically represents documents over time, 2) amplifies a feature in proportional to its burstiness at any point in time, and 3) is topic independent. Our bursty text representation model was evaluated against a classical bagof-words text representation on the task of clustering TDT3 topical text streams. It was shown to consistently yield more cohesive clusters in terms of cluster purity and cluster/class entropies. This new temporal bursty text representation can be extended to most text mining tasks involving a temporal dimension, such as modeling of online blog pages.
Discipline
Databases and Information Systems | Numerical Analysis and Scientific Computing
Publication
Proceedings of the 2007 SIAM International Conference on Data Mining: April 26-28, Minneapolis
First Page
491
Last Page
496
ISBN
9780898716306
Identifier
10.1137/1.9781611972771.50
Publisher
SIAM
City or Country
Philadelphia, PA
Citation
HE, Qi; CHANG, Kuiyu; LIM, Ee Peng; and ZHANG, Jun.
Bursty feature representation for clustering text streams. (2007). Proceedings of the 2007 SIAM International Conference on Data Mining: April 26-28, Minneapolis. 491-496.
Available at: https://ink.library.smu.edu.sg/sis_research/1273
Additional URL
http://doi.org/10.1137/1.9781611972771.50