ZHAO Guangzhe, JIN Ming, QIU Shuang, WANG Xueping, YAN Feihu
1. School of Intelligence Science and Technology, Beijing University of Civil Engineering and Architecture, Beijing 102616, China; 2. Beijing Key Laboratory of Super Intelligent Technology for Urban Architecture, Beijing University of Civil Engineering and Architecture, Beijing 102616, China
Abstract:Human motion generation aims to generate realistic, high-quality human motion. Aiming to summarize the recent advances in text-driven human motion generation technology, through extensively investigating relevant research and literature, this paper systematically reviews the development process and research status of the text-driven human motion generation task. It comprehensively summarizes the model methods related to the task by classifying the generation models and further analyzes the research progress of key technical issues. It summarizes the commonly used datasets and evaluation methods and deeply discusses the unresolved problems and possible future research directions in this field.
赵光哲, 金铭, 邱爽, 王雪平, 闫飞虎. 文本驱动的人体运动生成综述[J]. 复杂系统与复杂性科学, 2025, 22(2): 64-72.
ZHAO Guangzhe, JIN Ming, QIU Shuang, WANG Xueping, YAN Feihu. A Survey of Text-driven Human Motion Generation[J]. Complex Systems and Complexity Science, 2025, 22(2): 64-72.
[1] CHEN X, JIANG B, LIU W, et al. Executing your commands via motion diffusion in latent space[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2023: 1800018010. [2] GUO C, MU Y X, JAVED M G, et al. Momask: generative masked modeling of 3d human motions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 19001910. [3] YI H, THIES J, BLACK M J, et al. Generating human interaction motions in scenes with text control[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2025: 246263. [4] PAVLAKOS G, CHOUTAS V, GHORBANI N, et al. Expressive body capture: 3d hands, face, and body from a single image[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2019: 1097510985. [5] 戴汝为. 开展“系统复杂性”研究任重而道远[J]. 复杂系统与复杂性科学, 2004, 1(3): 13. DAI R W. The research on systems complexity——long-term and huge task[J]. Complex Systems and Complexity Science, 2004, 1(3): 13. [6] 彭淑娟, 周兵, 柳欣, 等. 人体运动生成中的深度学习模型综述[J]. 计算机辅助设计与图形学学报, 2018, 30(6): 11661176. PENG S J, ZHOU B, LIU X, et al. Recent advances in deep learning model for human motion generation[J]. Journal of Computer-Aided Design & Computer Graphics, 2018, 30(6): 11661176. [7] ZHU W T, MA X, RO D, et al. Human motion generation: a survey[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(4): 24302449. [8] 赵宝全, 付一愉, 苏卓, 等. 多模态信息引导的三维数字人运动生成综述[J]. 中国图象图形学报, 2024, 29(9): 25412565. ZHAO B Q, FU Y Y, SU Z, et al. A survey on multimodal information-guided 3D human motion generation[J]. Journal of Image and Graphics, 2024, 29(9): 25412565. [9] ELMAN J L. Finding structure in time[J] Cognitive Science, 1990, 14(2): 179211. [10] KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks[C]//Proceedings of the International Conference on Learning Representations. Washington DC: ICLR, 2017: 114. [11] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative adversarial nets[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2014: 26722680. [12] KINGMA D P, WELLING M. Auto-encoding variational bayes[C]//Proceedings of the International Conference on Learning Representations. Washington DC: ICLR, 2014: 114. [13] SOHL-DICKSTEIN J, WEISS E, MAHESWARANATHAN N, et al. Deep unsupervised learning using nonequilibrium thermodynamics[C]//Proceedings of the International Conference on Machine Learning. New York: JMLR, 2015: 22562265. [14] VASWANI A, SHAZEER N M, PARMAR N, et al. Attention is all you need[C]//Proceedings of the International Conference on Neural Information Processing Systems. New York: Curran Associates., 2017: 60006010. [15] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 17351780. [16] LI Z M, ZHOU Y, XIAO S J, et al. Auto-conditioned lstm network for extended complex human motion synthesis[C]//Proceedings of the International Conference on Learning Representations. Washington DC: ICLR, 2017: 113. [17] WANG Z Y, CHAI J X, XIA S H. Combining recurrent neural networks and adversarial training for human motion synthesis and control[C]//Proceedings of the IEEE Transactions on Visualization and Computer Graphics. Los Alamitos: IEEE Computer Society Press, 2018: 1428. [18] BATTAN N, AGRAWAL Y, SAISOORYARAO V, et al. Glocalnet: class-aware long-term human motion synthesis[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision(WACV). Los Alamitos: IEEE Computer Society Press, 2021: 878887. [19] AHUJA C, MORENCY L P. Language2pose: natural language grounded pose forecasting[C]//Proceedings of the International Conference on 3D Vision(3DV). Los Alamitos: IEEE Computer Society Press, 2019: 719728. [20] TULYAKOV S, LIU M Y, YANG X D, et al. Mocogan: decomposing motion and content for video generation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2018: 15261535. [21] AHN H, HA T, CHOI Y, et al. Text2Action: generative adversarial synthesis from language to action[C]//Proceedings of 2018 IEEE International Conference on Robotics and Automation. Los Alamitos: IEEE Computer Society Press, 2017: 15. [22] YU P, ZHAO Y, LI C Y, et al. Structure-aware human-action generation[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 1834. [23] GHOSH A, CHEEMA N, OGUZ C, et al. Synthesis of compositional animations from textual descriptions[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2021: 13961406. [24] VAN DEN OORD A, VINYALS O. Neural discrete representation learning[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: Curran Associates., 2017: 63096318. [25] GUO C, ZUO X, WANG S, et al. Action2motion: conditioned generation of 3d human motions[C]//Proceedings of ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2020: 20212029. [26] PETROVICH M, BLACK M J, VAROL G. Action-conditioned 3d human motion synthesis with transformer vae[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2021: 1098510995. [27] PETROVICH M, BLACK M J, VAROL G. Temos: generating diverse human motions from textual descriptions[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2022: 480497. [28] GUO C, ZUO X, WANG S, et al. Tm2t: stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2022: 580597. [29] HONG M, ZHANG L, PAN Z, et al. Avatarclip: zero-shot text-driven generation and animation of 3d avatars[J]. ACM Transactions on Graphics(TOG), 2022, 41(4): 119. [30] LU Q, ZHANG Y, LU M, et al. Action-conditioned on-demand motion generation[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2022: 22492257. [31] ZHANG J R, ZHANG Y S, CUN X D, et al. Generating human motion from textual descriptions with discrete representations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2023: 1473014740. [32] LIN J F, CHANG J L, LIU L B, et al. Being comes from not-being: open-vocabulary text-to-motion generation with wordless training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2023: 2322223231. [33] ZHANG M, CAI Z, PAN L, et al. Motiondiffuse: text-driven human motion generation with diffusion model[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 46: 41154128. [34] TEVET G, RAAB S, GORDON B, et al. Human motion diffusion model[C]//Proceedings of the International Conference on Learning Representations. Washington DC: ICLR, 2023: 112. [35] KIM J, KIM J, CHOI S. Flame: free-form language-based motion synthesis & editing[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Menlo Park: AAAI Press, 2023: 82558263. [36] DABRAL R, MUGHAL M H, GOLYANIK V, et al. Mofusion: a framework for denoising-diffusion-based motion synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2023: 97609770. [37] TEVET G, GORDON B, HERTZ A, et al. Motionclip: exposing human motion generation to clip space[C]//Proceedings of the European Conference on Computer Vision(ECCV). Cham: Springer, 2022: 358374. [38] ATHANASIOU N, PETROVICH M, BLACK M J, et al. Teach: temporal action compositions for 3d humans[C]//Proceedings of the International Conference on 3D Vision(3DV). Los Alamitos: IEEE Computer Society Press, 2022: 414423. [39] ZHOU Z X, WANG B Y. Ude: a unified driving engine for human motion generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2023: 56325641. [40] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Learning. New York: PMLR, 2021: 87488763. [41] WANG Y, LENG Z, LI F, et al. Fg-t2m: fine-grained text-driven human motion generation via diffusion model[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2023: 2197821987. [42] ATHANASIOU N, PETROVICH M, BLACK M J, et al. Sinc: spatial composition of 3d human motions for simultaneous action generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2023: 99509961. [43] SUN H W, ZHENG R K, HUANG H B, et al. Lgtm: local-to-global text-driven human motion diffusion model[C]//Proceedings of ACM SIGGRAPH 2024 Conference Papers. New York: Association for Computing Machinery, 2024: 19. [44] PETROVICH M, LITANY O, IQBAL U, et al. Multi-track timeline control for text-driven 3d human motion generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 19111921. [45] ZHONG C Y, HU L, ZHANG Z H, et al. AttT2m: text-driven human motion generation with multi-perspective attention mechanism[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2023: 509519. [46] SHENG J, LIN M, ZHAO A, et al. Exploring text-to-motion generation with human preference[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Los Alamitos: IEEE Computer Society Press, 2024: 18881899. [47] BARQUERO G, ESCALERA S, PALMERO C. Seamless human motion composition with blended positional encodings[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 457469. [48] LEE T J, BARADEL F, LUCAS T, et al. T2lm: long-term 3d human motion generation from multiple sentences[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops(CVPRW). Los Alamitos: IEEE Computer Society Press, 2024: 18671876. [49] SHAFIR Y, TEVET G, KAPON R, et al. Human motion diffusion as a generative prior[C]//Proceedings of the International Conference on Learning Representations. Washington DC: ICLR, 2023: 110. [50] YANG Z H, SU B, WEN J R. Synthesizing long-term human motions with diffusion models via coherent sampling[C]//Proceedings of ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2023: 39543964. [51] PINYOANUNTAPONG E, WANG P, LEE M, et al. Mmm: generative masked motion model[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 15461555. [52] KARUNRATANAKUL K, PREECHAKUL K, SUWAJANAKORN S, et al. Guided motion diffusion for controllable human motion synthesis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2023: 21512162. [53] YOON Y, CHA B, LEE J H, et al. Speech gesture generation from the trimodal context of text, audio, and speaker identity[J]. ACM Transactions on Graphics(TOG), 2020, 39: 116. [54] YUAN Y, SONG J M, IQBAL U, et al. Physdiff: physics-guided human motion diffusion model[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2022: 1596415975. [55] 戴汝为. 从基于逻辑的人工智能到社会智能的发展[J]. 复杂系统与复杂性科学, 2006, 3(2): 2125. DAI R W. Intelligence development from the logic artificial intelligence to social intelligence[J]. Complex Systems and Complexity Science, 2006, 3(2): 2125. [56] WANG Z, CHEN Y X, LIU T Y, et al. Humanise: language-conditioned human motion generation in 3d scenes[C]//Proceedings of the International Conference on Neural Information Processing Systems. New York: Curran Associates., 2022: 1495914971. [57] GHOSH A, DABRAL R, GOLYANIK V, et al. Imos: intent-driven full-body motion synthesis for human-object interactions[J]. Computer Graphics Forum, 2023, 42(2): 112. [58] WANG Z, CHEN Y X, JIA B X, et al. Move as you say, interact as you can: language-guided human motion generation with scene affordance[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 433444. [59] LIU H C, ZHAN X H, HUANG S L, et al. Programmable motion generation for open-set motion control tasks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2024: 13991408. [60] MATTHEW L, NAUREEN M, JAVIER R, et al. Smpl: a skinned multi-person linear model[J]. ACM Transactions on Graphics(TOG), 2015, 34(6): 116. [61] MAHMOOD N, GHORBANI N, TROJE N F, et al. Amass: archive of motion capture as surface shapes[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos: IEEE Computer Society Press, 2019: 54425451. [62] PUNNAKKAL A R, CHANDRASEKARAN A, ATHANASIOU N, et al. Babel: bodies, action and behavior with english labels[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2021: 722731. [63] JI Y, XU F, YANG Y, et al. A largescale rgb-d database for arbitrary-view human action recognition[C]//Proceedings of ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2018: 15101518. [64] LIU J, SHAHROUDY A, PEREZ M, et al. Ntu rgb+d 120: a large-scale benchmark for 3d human activity understanding[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(10): 26842701. [65] GUO C, ZOU S H, ZUO X X, et al. Generating diverse and natural 3d human motions from text[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos: IEEE Computer Society Press, 2022: 51525161. [66] PLAPPERT M, MANDERY C, ASFOUR T. The kit motion-language dataset[J]. Big data, 2016, 4(4): 236252. [67] LIN J, ZENG A L, LU S L, et al. Motion-x: a large-scale 3d expressive whole-body human motion dataset[C]//Proceedings of the International Conference on Neural Information Processing Systems. New York: Curran Associates., 2024: 113.