Modeling Syntactic Structures of Topics with a Nested HMM-LDA
Publication Type
Conference Proceeding Article
Version
publishedVersion
Publication Date
12-2009
Abstract
Latent Dirichlet allocation (LDA) is a commonly used topic modeling method for text analysis and mining. Standard LDA treats documents as bags of words, ignoring the syntactic structures of sentences. In this paper, we propose a hybrid model that embeds hidden Markov models (HMMs) within LDA topics to jointly model both the topics and the syntactic structures within each topic. Our model is general and subsumes standard LDA and HMM as special cases. Compared with standard LDA and HMM, our model can simultaneously discover both topic-specific content words and background functional words shared among topics. Our model can also automatically separate content words that play different roles within a topic. Using perplexity as evaluation metric, our model returns lower perplexity for unseen test documents compared with standard LDA, which shows its better generalization power than LDA.
Keywords
background functional words, hidden Markov models, latent Dirichlet allocation, syntactic structure modeling, text analysis, text mining, topic modeling method, topic-specific content words
Discipline
Computer Sciences | Numerical Analysis and Scientific Computing
Research Areas
Information Systems and Management
Publication
9th IEEE International Conference on Data Mining
First Page
824
Last Page
829
ISBN
9780769538952
Identifier
10.1109/ICDM.2009.144
Publisher
IEEE
City or Country
Miami, FL
Citation
JIANG, Jing.
Modeling Syntactic Structures of Topics with a Nested HMM-LDA. (2009). 9th IEEE International Conference on Data Mining. 824-829.
Available at: https://ink.library.smu.edu.sg/sis_research/351
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
Additional URL
http://dx.doi.org/10.1109/ICDM.2009.144