Bursty feature representation for clustering text streams

Publication Type

Conference Proceeding Article

Publication Date

4-2007

Abstract

Text representation plays a crucial role in classical text mining, where the primary focus was on static text. Nevertheless, well-studied static text representations including TFIDF are not optimized for non-stationary streams of information such as news, discussion board messages, and blogs. We therefore introduce a new temporal representation for text streams based on bursty features. Our bursty text representation differs significantly from traditional schemes in that it 1) dynamically represents documents over time, 2) amplifies a feature in proportional to its burstiness at any point in time, and 3) is topic independent. Our bursty text representation model was evaluated against a classical bagof-words text representation on the task of clustering TDT3 topical text streams. It was shown to consistently yield more cohesive clusters in terms of cluster purity and cluster/class entropies. This new temporal bursty text representation can be extended to most text mining tasks involving a temporal dimension, such as modeling of online blog pages.

Discipline

Databases and Information Systems | Numerical Analysis and Scientific Computing

Publication

Proceedings of the 2007 SIAM International Conference on Data Mining: April 26-28, Minneapolis

First Page

491

Last Page

496

ISBN

9780898716306

Identifier

10.1137/1.9781611972771.50

Publisher

SIAM

City or Country

Philadelphia, PA

Additional URL

http://doi.org/10.1137/1.9781611972771.50

Share

COinS