Standardizing Document Generation Based on Large Language Models
LIU Zheze1,2, ZHENG Nan1,3, ZHANG Ning4
1. Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; 2. School of Cryptography and Cyberspace Security, Nankai University, Tianjin 300350, China; 3. School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China; 4.Institute of Forensic Science Ministry of Public Security, Beijing 100038, China
Abstract:In order to promote the standardized development of various industries, corresponding standardizing documents need to be formulated in various fields, such as national standard and industry standard. These standardizing documents not only provide a unified operating standard for the industry, but also provide a clear guidance basis for relevant parties. The Central Committee of the CPC and the State Council clearly pointed out in the "the Outlines for the Development of National Standardization" that promoting the digitalization process of standard is an important measure to realize the modernization of the industry. Therefore, it is particularly important to carry out research on the automatic generation of standardizing documents. With the rapid development of artificial intelligence technology, especially the outstanding performance of large language models in text generation tasks, it is possible to use these advanced technologies to realize the automatic generation of standardizing documents. Based on this background, this paper proposes a two-stage scheme for generating standardizing documents. The scheme first generates the outline of the standardizing document through the large model, and then expands to generate the complete document content on this basis. By combining in-context learning and retrieval augmented generation techniques, this method can not only generate high-quality text, but also significantly improve the accuracy and professionalism of the generated content. In order to verify the feasibility of the scheme, we conducted a series of experiments on our self-built dataset, and the results show that the method can effectively generate documents that meet industry standards, and has good practicability and promotion potential.
刘哲泽, 郑楠, 张宁. 基于大语言模型的标准化文件生成方法研究[J]. 复杂系统与复杂性科学, 2025, 22(2): 45-54.
LIU Zheze, ZHENG Nan, ZHANG Ning. Standardizing Document Generation Based on Large Language Models[J]. Complex Systems and Complexity Science, 2025, 22(2): 45-54.
[1] 中华人民共和国国务院. 中共中央国务院印发《国家标准化发展纲要》[EB/OL].[20250328]. https://www.gov.cn/gongbao/content/2021/content_5647347.htm. State Council of the People′s Republic of China. The central committee of the cpc and the state council print and issue the outlines for the development of national standardization[EB/OL].[20250328]. https://www.gov.cn/gongbao/content/2021/content_5647347.htm. [2] MI L, LI C R, DU P, et al. Construction and application of an automatic document generation model[C]. 2018 26th International Conference on Geoinformatics. Kunming, China, 2018: 16. [3] 李若晨, 肖人彬. 基于改进狼群算法优化LSTM网络的舆情演化预测[J]. 复杂系统与复杂性科学, 2024, 21(1): 110. LI R C, XIAO R B. Public opinion evolution prediction based on LSTM network optimized by an improved wolf pack algorithm[J]. Complex Systems and Complexity Science, 2024, 21(1): 110. [4] 李炎,李宪,杨明业,等. 基于概率优化的神经网络模型组合算法[J]. 复杂系统与复杂性科学, 2022, 19(3): 104109. LI Y, LI X, YANG M Y, et al. Neural network model combination algorithm based on probability optimization[J]. Complex Systems and Complexity Science, 2022, 19(3): 104109. [5] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]. Advances in Neural Information Processing Systems. Long Beach, California, USA: Curran Associates, Inc, 2020: 18771901. [6] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL].[20250328]. https://arxiv.org/abs/1810.04805. [7] SHAO Y J, JIANG Y C, KANELL T, et al. Assisting in writing wikipedia-like articles from scratch with large language models[C]. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Mexico: Association for Computational Linguistics, 2024, 1:62526278. [8] LIN C H, CHENG P J. Legal documents drafting with fine-tuned pre-trained large language model[EB/OL].[20250328]. https://arxiv.org/abs/2406.04202. [9] FAN A, GARDENT C. Generating biographies on wikipedia: the impact of gender bias on the retrieval-based generation of women biographies[C]. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin, Ireland: Association for Computational Linguistics, 2022,1: 85618576. [10] HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-Efficient Transfer Learning for NLP[C]. Proceedings of the 36th International Conference on Machine Learning, ICML 2019. Long Beach, California, USA, 2019: 27902799. [11] HU E J, SHEN Y L, WALLIS P, et al. LoRA: low-rank adaptation of large language models[C]. The Tenth International Conference on Learning Representations, ICLR 2022. Virtual Event, 2022: 113. [12] 沈佳妮,曹剑峰,殷亦超,等. 基于大模型构建卫生标准文档规范性质控系统的研究[J]. 中国卫生信息管理杂志,2023, 20(6): 875880,896. SHEN J N, CAO J F, YIN Y C, et al. Study on construction of a normative quality control system for health standard document based on large models[J]. Chinese Journal of Health Informatics and Management, 2023, 20(6): 875880,896. [13] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]. Advances in Neural Information Processing Systems. Vancouver, BC, Canada: Curran Associates, Inc, 2020: 18771901. [14] LEWIS P, PEREZ E, PIKTUS A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks[C]. Advances in Neural Information Processing Systems. Vancouver, BC, Canada: Curran Associates, Inc, 2020: 94599474. [15] BALEPUR N, HUANG J, CHANG K. Expository text generation: imitate, retrieve, paraphrase[C]. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, 2023: 1189611919. [16] SEMNANI S, YAO V, ZHANG H, et al. WikiChat: stopping the hallucination of large language model chatbots by few-shot grounding on wikipedia[C]. Findings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics. 2023: 23872413. [17] MIN S, LYU X, HOLTZMAN A, et al. Rethinking the role of demonstrations: what makes in-context learning work?[C]. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022: 1104811064. [18] FRÄNTI P, MARIESCU-ISTODOR R. Soft precision and recall[J]. Pattern Recognit Lett, 2023, 167: 115121. [19] LIN C Y. ROUGE: A package for automatic evaluation of summaries[C]. Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, 2004: 7481.