A Topic Text Network Construction Method Based on PL-LDA Model
ZHANG Zhiyuan1,2, HUO Weigang1
1. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China; 2. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
Abstract:Labeled LDA can mine words’ probabilities under a given topic, however, it can’t analyze the association relationships among these topic words. Although the correlation between word pairs can be calculated by utilizing PMI (Pointwise Mutual Information), their relationship to the given topic is lost. Motivated by the operation of counting word pairs in a fixed window used in PMI, this paper proposes a topic model called PL-LDA (Pointwise Labeled LDA), which can compute the joint probabilities between word pairs under a given topic. Experimental results on aviation safety reports show that this model achieves results with good interpretability. Based on the results of PL-LDA, this paper constructs a topic text network, which provides rich and effective information for analyzers including reflecting the distribution of topic words and displaying the complex relationships among them.
张志远, 霍纬纲. 一种基于PL-LDA模型的主题文本网络构建方法[J]. 复杂系统与复杂性科学, 2017, 14(1): 52-57.
ZHANG Zhiyuan, HUO Weigang. A Topic Text Network Construction Method Based on PL-LDA Model[J]. Complex Systems and Complexity Science, 2017, 14(1): 52-57.
[1] Hofmann T.Unsupervised learning by probabilistic latent semantic analysis[J].Machine Learning, 2001, 42(1): 177-196. [2] Blei D M, Ng A Y, Jordan M I. Latent dirichletallocation[J].Journal of Machine Learning Research, 2003, 3: 993-1022. [3] Wallach H M. Topic modeling: beyond bag-of-words[C]//Proceedings of the 23rd International Conference on Machine Learning. NY:ACM,2006: 977-984. [4] Wang X, McCallum A, Wei X. Topical n-grams: Phrase and topic discovery, with an application to information retrieval[C]//Proceedings of the seventh IEEE International Conference on Data Mining. NJ:IEEE, 2007: 697-702. [5] Noji H, Mochihashi D, Miyao Y. Improvements to the Bayesian topic N-Gram models[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle:ACL, 2013: 1180-1190. [6] Zhang D, Zhai C X, Han J, et al. Topic modeling for OLAP on multidimensional text databases: topic cube and its applications[J].Statistical Analysis and Data Mining: the ASA Data Science Journal, 2009, 2(5/6): 378-395. [7] Blei D M, Mcauliffe J D. Supervised Topic Models[J].Advances in Neural Information Processing Systems, 2010, 3:327-332. [8] Ramage D, Hall D, Nallapati R, et al. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore:ACL, 2009: 248-256. [9] Li X, Ouyang J, Zhou X. Supervised topic models for multi-label classification[J].Neurocomputing, 2015, 149: 811-819. [10] Chuang J, Manning C D, Heer J. Termite: Visualization techniques for assessing textual topic models[C]//Proceedings of the International Working Conference on Advanced Visual Interfaces. NY:ACM, 2012: 74-77. [11] Dou W, Yu L, Wang X, et al. Hierarchicaltopics: Visually exploring large text collections using topic hierarchies[J].IEEE Transactions on Visualization and Computer Graphics, 2013, 19(12): 2002-2011. [12] Smith A, Chuang J, Hu Y, et al. Concurrent visualization of relationships between words and topics in topic models[C]//Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. Baltimore:ACL, 2014: 79-82. [13] Han L, Finin T, McNamee P, et al. Improving word similarity by augmenting pmi with estimates of word polysemy[J].IEEE Transactions on Knowledge and Data Engineering,2013, 25(6): 1307-1322. [14] Griffiths T L, Steyvers M. Finding scientific topics[J].PNAS, 2004, 101(suppl 1): 5228-5235. [15] Church K W, Hanks P. Word association norms, mutual information, and lexicography[J].Computational Linguistics, 1990, 16(1): 22-29. [16] Manning C, Schütze H, Foundations of Statistical NaturalLanguage Processing[M].Cambridge,MA:MIT Press, 1999. [17] Turney P D. Mining the web for synonyms: PMI-IR versus LSA on TOEFL[J].Computer Science, 2002, 2167:491-502.