Bursty feature representation for clustering text streams
Conference Proceeding Article
Text representation plays a crucial role in classical text mining, where the primary focus was on static text. Nevertheless, well-studied static text representations including TFIDF are not optimized for non-stationary streams of information such as news, discussion board messages, and blogs. We therefore introduce a new temporal representation for text streams based on bursty features. Our bursty text representation differs significantly from traditional schemes in that it 1) dynamically represents documents over time, 2) amplifies a feature in proportional to its burstiness at any point in time, and 3) is topic independent. Our bursty text representation model was evaluated against a classical bagof-words text representation on the task of clustering TDT3 topical text streams. It was shown to consistently yield more cohesive clusters in terms of cluster purity and cluster/class entropies. This new temporal bursty text representation can be extended to most text mining tasks involving a temporal dimension, such as modeling of online blog pages.
Databases and Information Systems | Numerical Analysis and Scientific Computing
Proceedings of the 2007 SIAM International Conference on Data Mining: April 26-28, Minneapolis
City or Country
HE, Qi; CHANG, Kuiyu; LIM, Ee Peng; and ZHANG, Jun.
Bursty feature representation for clustering text streams. (2007). Proceedings of the 2007 SIAM International Conference on Data Mining: April 26-28, Minneapolis. 491-496. Research Collection School Of Information Systems.
Available at: https://ink.library.smu.edu.sg/sis_research/1273