文本驱动的人体运动生成综述

doi:10.13306/j.1672-3813.2025.02.008

复杂系统与复杂性科学

2025, Vol. 22

Issue (2): 64-72 DOI: 10.13306/j.1672-3813.2025.02.008

特邀专栏

本期目录 | 过刊浏览 | 高级检索

文本驱动的人体运动生成综述

赵光哲, 金铭, 邱爽, 王雪平, 闫飞虎

1.北京建筑大学智能科学与技术学院,北京 102616;
2.城市建筑超级智能技术北京市重点实验室,北京 102616

A Survey of Text-driven Human Motion Generation

ZHAO Guangzhe, JIN Ming, QIU Shuang, WANG Xueping, YAN Feihu

1. School of Intelligence Science and Technology, Beijing University of Civil Engineering and Architecture, Beijing 102616, China;
2. Beijing Key Laboratory of Super Intelligent Technology for Urban Architecture, Beijing University of Civil Engineering and Architecture, Beijing 102616, China

摘要
参考文献
相关文章
Metrics

全文: PDF(1669 KB)
输出: BibTeX | EndNote (RIS)

摘要人体运动生成旨在生成真实、高质量的人体运动,为总结文本驱动的人体运动生成技术的研究进展,通过广泛调研相关研究与文献,系统梳理了文本驱动的人体运动生成任务的发展过程与研究现状。按生成模型分类全面归纳了任务相关的模型方法,对技术重点问题的研究进展做了进一步分析。总结了实验中常用数据集和评价方法,并深入讨论了该领域亟待解决的问题和未来可能的研究方向。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	赵光哲
	金铭
	邱爽
	王雪平
	闫飞虎

关键词 ：数字人, 运动生成, 人体运动, 文本驱动, 深度学习

Abstract：Human motion generation aims to generate realistic, high-quality human motion. Aiming to summarize the recent advances in text-driven human motion generation technology, through extensively investigating relevant research and literature, this paper systematically reviews the development process and research status of the text-driven human motion generation task. It comprehensively summarizes the model methods related to the task by classifying the generation models and further analyzes the research progress of key technical issues. It summarizes the commonly used datasets and evaluation methods and deeply discusses the unresolved problems and possible future research directions in this field.

Key words： avatar motion generation human motion text-driven deep learning

收稿日期: 2025-04-14 出版日期: 2025-06-03

ZTFLH:	TP391
	TP183

基金资助:科技基础资源调查专项(2019FY100900);国家自然科学基金(62176018,62402026);北京市教育委员会科研计划项目资助(KM202410016010);北京建筑大学青年教师科研能力提升计划(X23026);北京建筑大学研究生创新项目(PG2025113)

通讯作者: 王雪平(1990),女,山西吕梁人,博士,讲师,主要研究方向为模式识别与图像处理、情感计算。

作者简介: 赵光哲(1979),男,吉林延边人,博士,教授,主要研究方向为计算机视觉与图像处理、模式识别、人工智能。

引用本文:

赵光哲, 金铭, 邱爽, 王雪平, 闫飞虎. 文本驱动的人体运动生成综述[J]. 复杂系统与复杂性科学, 2025, 22(2): 64-72.
ZHAO Guangzhe, JIN Ming, QIU Shuang, WANG Xueping, YAN Feihu. A Survey of Text-driven Human Motion Generation[J]. Complex Systems and Complexity Science, 2025, 22(2): 64-72.

链接本文:

https://fzkx.qdu.edu.cn/CN/10.13306/j.1672-3813.2025.02.008 或 https://fzkx.qdu.edu.cn/CN/Y2025/V22/I2/64

[1] CHEN X, JIANG B, LIU W, et al. Executing your commands via motion diffusion in latent space[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2023: 1800018010.
[2] GUO C, MU Y X, JAVED M G, et al. Momask: generative masked modeling of 3d human motions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 19001910.
[3] YI H, THIES J, BLACK M J, et al. Generating human interaction motions in scenes with text control[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2025: 246263.
[4] PAVLAKOS G, CHOUTAS V, GHORBANI N, et al. Expressive body capture: 3d hands, face, and body from a single image[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2019: 1097510985.
[5] 戴汝为. 开展“系统复杂性”研究任重而道远[J]. 复杂系统与复杂性科学, 2004, 1(3): 13.
DAI R W. The research on systems complexity——long-term and huge task[J]. Complex Systems and Complexity Science, 2004, 1(3): 13.
[6] 彭淑娟, 周兵, 柳欣, 等. 人体运动生成中的深度学习模型综述[J]. 计算机辅助设计与图形学学报, 2018, 30(6): 11661176.
PENG S J, ZHOU B, LIU X, et al. Recent advances in deep learning model for human motion generation[J]. Journal of Computer-Aided Design & Computer Graphics, 2018, 30(6): 11661176.
[7] ZHU W T, MA X, RO D, et al. Human motion generation: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(4): 24302449.
[8] 赵宝全, 付一愉, 苏卓, 等. 多模态信息引导的三维数字人运动生成综述[J]. 中国图象图形学报, 2024, 29(9): 25412565.
ZHAO B Q, FU Y Y, SU Z, et al. A survey on multimodal information-guided 3D human motion generation[J]. Journal of Image and Graphics, 2024, 29(9): 25412565.
[9] ELMAN J L. Finding structure in time[J] Cognitive Science, 1990, 14(2): 179211.
[10] KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[C]//Proceedings of the International Conference on Learning Representations. Washington DC: ICLR, 2017: 114.
[11] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2014: 26722680.
[12] KINGMA D P, WELLING M. Auto-encoding variational bayes[C]//Proceedings of the International Conference on Learning Representations. Washington DC: ICLR, 2014: 114.
[13] SOHL-DICKSTEIN J, WEISS E, MAHESWARANATHAN N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]//Proceedings of the International Conference on Machine Learning. New York: JMLR, 2015: 22562265.
[14] VASWANI A, SHAZEER N M, PARMAR N, et al. Attention is all you need[C]//Proceedings of the International Conference on Neural Information Processing Systems. New York: Curran Associates., 2017: 60006010.
[15] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 17351780.
[16] LI Z M, ZHOU Y, XIAO S J, et al. Auto-conditioned lstm network for extended complex human motion synthesis[C]//Proceedings of the International Conference on Learning Representations. Washington DC: ICLR, 2017: 113.
[17] WANG Z Y, CHAI J X, XIA S H. Combining recurrent neural networks and adversarial training for human motion synthesis and control[C]//Proceedings of the IEEE Transactions on Visualization and Computer Graphics. Los Alamitos: IEEE Computer Society Press, 2018: 1428.
[18] BATTAN N, AGRAWAL Y, SAISOORYARAO V, et al. Glocalnet: class-aware long-term human motion synthesis[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision(WACV). Los Alamitos: IEEE Computer Society Press, 2021: 878887.
[19] AHUJA C, MORENCY L P. Language2pose: natural language grounded pose forecasting[C]//Proceedings of the International Conference on 3D Vision(3DV). Los Alamitos: IEEE Computer Society Press, 2019: 719728.
[20] TULYAKOV S, LIU M Y, YANG X D, et al. Mocogan: decomposing motion and content for video generation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2018: 15261535.
[21] AHN H, HA T, CHOI Y, et al. Text2Action: generative adversarial synthesis from language to action[C]//Proceedings of 2018 IEEE International Conference on Robotics and Automation. Los Alamitos: IEEE Computer Society Press, 2017: 15.
[22] YU P, ZHAO Y, LI C Y, et al. Structure-aware human-action generation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 1834.
[23] GHOSH A, CHEEMA N, OGUZ C, et al. Synthesis of compositional animations from textual descriptions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2021: 13961406.
[24] VAN DEN OORD A, VINYALS O. Neural discrete representation learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: Curran Associates., 2017: 63096318.
[25] GUO C, ZUO X, WANG S, et al. Action2motion: conditioned generation of 3d human motions[C]//Proceedings of ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2020: 20212029.
[26] PETROVICH M, BLACK M J, VAROL G. Action-conditioned 3d human motion synthesis with transformer vae[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2021: 1098510995.
[27] PETROVICH M, BLACK M J, VAROL G. Temos: generating diverse human motions from textual descriptions[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2022: 480497.
[28] GUO C, ZUO X, WANG S, et al. Tm2t: stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2022: 580597.
[29] HONG M, ZHANG L, PAN Z, et al. Avatarclip: zero-shot text-driven generation and animation of 3d avatars[J]. ACM Transactions on Graphics(TOG), 2022, 41(4): 119.
[30] LU Q, ZHANG Y, LU M, et al. Action-conditioned on-demand motion generation[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2022: 22492257.
[31] ZHANG J R, ZHANG Y S, CUN X D, et al. Generating human motion from textual descriptions with discrete representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2023: 1473014740.
[32] LIN J F, CHANG J L, LIU L B, et al. Being comes from not-being: open-vocabulary text-to-motion generation with wordless training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2023: 2322223231.
[33] ZHANG M, CAI Z, PAN L, et al. Motiondiffuse: text-driven human motion generation with diffusion model[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 46: 41154128.
[34] TEVET G, RAAB S, GORDON B, et al. Human motion diffusion model[C]//Proceedings of the International Conference on Learning Representations. Washington DC: ICLR, 2023: 112.
[35] KIM J, KIM J, CHOI S. Flame: free-form language-based motion synthesis & editing[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park: AAAI Press, 2023: 82558263.
[36] DABRAL R, MUGHAL M H, GOLYANIK V, et al. Mofusion: a framework for denoising-diffusion-based motion synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2023: 97609770.
[37] TEVET G, GORDON B, HERTZ A, et al. Motionclip: exposing human motion generation to clip space[C]//Proceedings of the European Conference on Computer Vision(ECCV). Cham: Springer, 2022: 358374.
[38] ATHANASIOU N, PETROVICH M, BLACK M J, et al. Teach: temporal action compositions for 3d humans[C]//Proceedings of the International Conference on 3D Vision(3DV). Los Alamitos: IEEE Computer Society Press, 2022: 414423.
[39] ZHOU Z X, WANG B Y. Ude: a unified driving engine for human motion generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2023: 56325641.
[40] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning. New York: PMLR, 2021: 87488763.
[41] WANG Y, LENG Z, LI F, et al. Fg-t2m: fine-grained text-driven human motion generation via diffusion model[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2023: 2197821987.
[42] ATHANASIOU N, PETROVICH M, BLACK M J, et al. Sinc: spatial composition of 3d human motions for simultaneous action generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2023: 99509961.
[43] SUN H W, ZHENG R K, HUANG H B, et al. Lgtm: local-to-global text-driven human motion diffusion model[C]//Proceedings of ACM SIGGRAPH 2024 Conference Papers. New York: Association for Computing Machinery, 2024: 19.
[44] PETROVICH M, LITANY O, IQBAL U, et al. Multi-track timeline control for text-driven 3d human motion generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 19111921.
[45] ZHONG C Y, HU L, ZHANG Z H, et al. AttT2m: text-driven human motion generation with multi-perspective attention mechanism[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2023: 509519.
[46] SHENG J, LIN M, ZHAO A, et al. Exploring text-to-motion generation with human preference[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Los Alamitos: IEEE Computer Society Press, 2024: 18881899.
[47] BARQUERO G, ESCALERA S, PALMERO C. Seamless human motion composition with blended positional encodings[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 457469.
[48] LEE T J, BARADEL F, LUCAS T, et al. T2lm: long-term 3d human motion generation from multiple sentences[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Los Alamitos: IEEE Computer Society Press, 2024: 18671876.
[49] SHAFIR Y, TEVET G, KAPON R, et al. Human motion diffusion as a generative prior[C]//Proceedings of the International Conference on Learning Representations. Washington DC: ICLR, 2023: 110.
[50] YANG Z H, SU B, WEN J R. Synthesizing long-term human motions with diffusion models via coherent sampling[C]//Proceedings of ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2023: 39543964.
[51] PINYOANUNTAPONG E, WANG P, LEE M, et al. Mmm: generative masked motion model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 15461555.
[52] KARUNRATANAKUL K, PREECHAKUL K, SUWAJANAKORN S, et al. Guided motion diffusion for controllable human motion synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2023: 21512162.
[53] YOON Y, CHA B, LEE J H, et al. Speech gesture generation from the trimodal context of text, audio, and speaker identity[J]. ACM Transactions on Graphics(TOG), 2020, 39: 116.
[54] YUAN Y, SONG J M, IQBAL U, et al. Physdiff: physics-guided human motion diffusion model[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2022: 1596415975.
[55] 戴汝为. 从基于逻辑的人工智能到社会智能的发展[J]. 复杂系统与复杂性科学, 2006, 3(2): 2125.
DAI R W. Intelligence development from the logic artificial intelligence to social intelligence[J]. Complex Systems and Complexity Science, 2006, 3(2): 2125.
[56] WANG Z, CHEN Y X, LIU T Y, et al. Humanise: language-conditioned human motion generation in 3d scenes[C]//Proceedings of the International Conference on Neural Information Processing Systems. New York: Curran Associates., 2022: 1495914971.
[57] GHOSH A, DABRAL R, GOLYANIK V, et al. Imos: intent-driven full-body motion synthesis for human-object interactions[J]. Computer Graphics Forum, 2023, 42(2): 112.
[58] WANG Z, CHEN Y X, JIA B X, et al. Move as you say, interact as you can: language-guided human motion generation with scene affordance[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 433444.
[59] LIU H C, ZHAN X H, HUANG S L, et al. Programmable motion generation for open-set motion control tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 13991408.
[60] MATTHEW L, NAUREEN M, JAVIER R, et al. Smpl: a skinned multi-person linear model[J]. ACM Transactions on Graphics(TOG), 2015, 34(6): 116.
[61] MAHMOOD N, GHORBANI N, TROJE N F, et al. Amass: archive of motion capture as surface shapes[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2019: 54425451.
[62] PUNNAKKAL A R, CHANDRASEKARAN A, ATHANASIOU N, et al. Babel: bodies, action and behavior with english labels[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2021: 722731.
[63] JI Y, XU F, YANG Y, et al. A largescale rgb-d database for arbitrary-view human action recognition[C]//Proceedings of ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2018: 15101518.
[64] LIU J, SHAHROUDY A, PEREZ M, et al. Ntu rgb+d 120: a large-scale benchmark for 3d human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 26842701.
[65] GUO C, ZOU S H, ZUO X X, et al. Generating diverse and natural 3d human motions from text[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2022: 51525161.
[66] PLAPPERT M, MANDERY C, ASFOUR T. The kit motion-language dataset[J]. Big data, 2016, 4(4): 236252.
[67] LIN J, ZENG A L, LU S L, et al. Motion-x: a large-scale 3d expressive whole-body human motion dataset[C]//Proceedings of the International Conference on Neural Information Processing Systems. New York: Curran Associates., 2024: 113.

[1]	王淑良, 陈辰, 张建华, 栾声扬. 基于复杂网络的关联公共交通系统韧性分析[J]. 复杂系统与复杂性科学, 2022, 19(4): 47-54.
[2]	张立, 孙华通, 潘园园, 段法兵. 人体手部运动的振荡共振辅助系统实验研究[J]. 复杂系统与复杂性科学, 2017, 14(4): 72-78.

Viewed

Full text

Abstract

Cited

Shared

Discussed