一种基于PL-LDA模型的主题文本网络构建方法

doi:10.13306/j.1672-3813.2017.01.008

复杂系统与复杂性科学

2017, Vol. 14

Issue (1): 52-57 DOI: 10.13306/j.1672-3813.2017.01.008

本期目录 | 过刊浏览 | 高级检索

一种基于PL-LDA模型的主题文本网络构建方法

张志远^1,2, 霍纬纲¹

1.中国民航大学计算机科学与技术学院,天津 300300;
2.南京航空航天大学计算机科学与技术学院, 南京 210016

A Topic Text Network Construction Method Based on PL-LDA Model

ZHANG Zhiyuan^1,2, HUO Weigang¹

1. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China;
2. College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

摘要
参考文献
相关文章
Metrics

全文: PDF(1110 KB)
输出: BibTeX | EndNote (RIS)

摘要 Labeled LDA能挖掘出给定主题下的单词概率分布,但却无法分析主题词之间的关联关系。采用PMI虽可计算两个单词的相互关系,但却和给定主题失去联系。受PMI在窗口中统计词对共现频率的启发,提出了一种PL-LDA(Pointwise Labeled LDA)主题模型,可计算给定主题下词对的联合概率分布,在航空安全报告数据集上的实验表明PL-LDA模型所得结果具有很好的解释性。利用PL-LDA构建了主题文本网络,该网络除能反映主题词分布外,还可展现它们之间的复杂关联关系。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	张志远
	霍纬纲

关键词 ：主题模型, 文本挖掘, 复杂网络, PMI

Abstract：Labeled LDA can mine words’ probabilities under a given topic, however, it can’t analyze the association relationships among these topic words. Although the correlation between word pairs can be calculated by utilizing PMI (Pointwise Mutual Information), their relationship to the given topic is lost. Motivated by the operation of counting word pairs in a fixed window used in PMI, this paper proposes a topic model called PL-LDA (Pointwise Labeled LDA), which can compute the joint probabilities between word pairs under a given topic. Experimental results on aviation safety reports show that this model achieves results with good interpretability. Based on the results of PL-LDA, this paper constructs a topic text network, which provides rich and effective information for analyzers including reflecting the distribution of topic words and displaying the complex relationships among them.

Key words： topic mode text mining complex network PMI

收稿日期: 2015-05-01 出版日期: 2025-02-24

ZTFLH:

TP181

基金资助:国家自然科学基金(61201414,61301245,U1233113)

作者简介: 张志远(1978-),男,河北景县人,硕士,副教授,主要研究方向为文本挖掘,数据仓库,复杂网络。

引用本文:

张志远, 霍纬纲. 一种基于PL-LDA模型的主题文本网络构建方法[J]. 复杂系统与复杂性科学, 2017, 14(1): 52-57.
ZHANG Zhiyuan, HUO Weigang. A Topic Text Network Construction Method Based on PL-LDA Model[J]. Complex Systems and Complexity Science, 2017, 14(1): 52-57.

链接本文:

https://fzkx.qdu.edu.cn/CN/10.13306/j.1672-3813.2017.01.008 或 https://fzkx.qdu.edu.cn/CN/Y2017/V14/I1/52

[1] Hofmann T.Unsupervised learning by probabilistic latent semantic analysis[J].Machine Learning, 2001, 42(1): 177-196.
[2] Blei D M, Ng A Y, Jordan M I. Latent dirichletallocation[J].Journal of Machine Learning Research, 2003, 3: 993-1022.
[3] Wallach H M. Topic modeling: beyond bag-of-words[C]//Proceedings of the 23rd International Conference on Machine Learning. NY:ACM,2006: 977-984.
[4] Wang X, McCallum A, Wei X. Topical n-grams: Phrase and topic discovery, with an application to information retrieval[C]//Proceedings of the seventh IEEE International Conference on Data Mining. NJ:IEEE, 2007: 697-702.
[5] Noji H, Mochihashi D, Miyao Y. Improvements to the Bayesian topic N-Gram models[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Seattle:ACL, 2013: 1180-1190.
[6] Zhang D, Zhai C X, Han J, et al. Topic modeling for OLAP on multidimensional text databases: topic cube and its applications[J].Statistical Analysis and Data Mining: the ASA Data Science Journal, 2009, 2(5/6): 378-395.
[7] Blei D M, Mcauliffe J D. Supervised Topic Models[J].Advances in Neural Information Processing Systems, 2010, 3:327-332.
[8] Ramage D, Hall D, Nallapati R, et al. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore:ACL, 2009: 248-256.
[9] Li X, Ouyang J, Zhou X. Supervised topic models for multi-label classification[J].Neurocomputing, 2015, 149: 811-819.
[10] Chuang J, Manning C D, Heer J. Termite: Visualization techniques for assessing textual topic models[C]//Proceedings of the International Working Conference on Advanced Visual Interfaces. NY:ACM, 2012: 74-77.
[11] Dou W, Yu L, Wang X, et al. Hierarchicaltopics: Visually exploring large text collections using topic hierarchies[J].IEEE Transactions on Visualization and Computer Graphics, 2013, 19(12): 2002-2011.
[12] Smith A, Chuang J, Hu Y, et al. Concurrent visualization of relationships between words and topics in topic models[C]//Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces. Baltimore:ACL, 2014: 79-82.
[13] Han L, Finin T, McNamee P, et al. Improving word similarity by augmenting pmi with estimates of word polysemy[J].IEEE Transactions on Knowledge and Data Engineering,2013, 25(6): 1307-1322.
[14] Griffiths T L, Steyvers M. Finding scientific topics[J].PNAS, 2004, 101(suppl 1): 5228-5235.
[15] Church K W, Hanks P. Word association norms, mutual information, and lexicography[J].Computational Linguistics, 1990, 16(1): 22-29.
[16] Manning C, Schütze H, Foundations of Statistical NaturalLanguage Processing[M].Cambridge,MA:MIT Press, 1999.
[17] Turney P D. Mining the web for synonyms: PMI-IR versus LSA on TOEFL[J].Computer Science, 2002, 2167:491-502.

[1]	马忠渝, 程言欣, 陈李燊, 廖启嘉, 钱江海. 基于适应度有序准入策略的网络凝聚调控[J]. 复杂系统与复杂性科学, 2024, 21(4): 6-12.
[2]	吴旗韬, 李苑庭, 吴海玲, 杨昀昊, 武俊强. 基于关键节点积极效应模型的快递物流网络点集挖掘[J]. 复杂系统与复杂性科学, 2024, 21(4): 28-33.
[3]	戴剑勇, 甘美艳, 张美荣, 毛佳志, 刘朝. 基于复杂网络的天然气管道网络风险传播研究[J]. 复杂系统与复杂性科学, 2024, 21(3): 69-76.
[4]	林思宇, 文娟, 屈星, 肖乾康. 基于TOPSIS的配电网结构优化及关键节点线路识别[J]. 复杂系统与复杂性科学, 2024, 21(3): 46-54.
[5]	侯静宇, 宋运忠. 基于多链路连锁故障图的电网脆弱线路分析[J]. 复杂系统与复杂性科学, 2024, 21(2): 68-74.
[6]	王淑良, 孙静雅, 卞嘉志, 张建华, 董琪琪, 李君婧. 基于博弈论的关联网络攻防博弈分析[J]. 复杂系统与复杂性科学, 2024, 21(2): 22-29.
[7]	高峰. 复杂网络深度重叠结构的发现[J]. 复杂系统与复杂性科学, 2024, 21(2): 15-21.
[8]	孙威威, 张峥. 基于复杂网络的电动汽车创新扩散博弈研究[J]. 复杂系统与复杂性科学, 2024, 21(2): 45-51.
[9]	刘建刚, 陈芦霞. 基于复杂网络的疫情冲击对上证行业影响分析[J]. 复杂系统与复杂性科学, 2024, 21(1): 43-50.
[10]	徐越, 刘雪明. 基于三元闭包模体的关键节点识别方法[J]. 复杂系统与复杂性科学, 2023, 20(4): 33-39.
[11]	马亮, 金福才, 胡宸瀚. 中国铁路快捷货物运输网络复杂性分析[J]. 复杂系统与复杂性科学, 2023, 20(4): 26-32.
[12]	董昂, 吴亚丽, 任远光, 冯梦琦. 基于局部熵的级联故障模型初始负载定义方式[J]. 复杂系统与复杂性科学, 2023, 20(4): 18-25.
[13]	董志良, 贾妍婧, 安海岗. 产业部门间间接能源流动及依赖关系演化特征[J]. 复杂系统与复杂性科学, 2023, 20(4): 61-68.
[14]	杨文东, 黄依宁, 张生润. 基于多层复杂网络的RCEP国际航线网络特征分析[J]. 复杂系统与复杂性科学, 2023, 20(3): 60-67.
[15]	任翠萍, 杨明翔, 张裕铭, 谢逢洁. 快递安全事故致因网络构建及结构特性分析[J]. 复杂系统与复杂性科学, 2023, 20(2): 74-80.

Viewed

Full text

Abstract

Cited

Shared

Discussed