Emotional Speech-driven 3D Body Animation Via Disentangled Latent Diffusion

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2024)

引用 0|浏览32
暂无评分
摘要
Existing methods for synthesizing 3D human gestures from speech have shownpromising results, but they do not explicitly model the impact of emotions onthe generated gestures. Instead, these methods directly output animations fromspeech without control over the expressed emotion. To address this limitation,we present AMUSE, an emotional speech-driven body animation model based onlatent diffusion. Our observation is that content (i.e., gestures related tospeech rhythm and word utterances), emotion, and personal style are separable.To account for this, AMUSE maps the driving audio to three disentangled latentvectors: one for content, one for emotion, and one for personal style. A latentdiffusion model, trained to generate gesture motion sequences, is thenconditioned on these latent vectors. Once trained, AMUSE synthesizes 3D humangestures directly from speech with control over the expressed emotions andstyle by combining the content from the driving speech with the emotion andstyle of another speech sequence. Randomly sampling the noise of the diffusionmodel further generates variations of the gesture with the same emotionalexpressivity. Qualitative, quantitative, and perceptual evaluations demonstratethat AMUSE outputs realistic gesture sequences. Compared to the state of theart, the generated gestures are better synchronized with the speech content,and better represent the emotion expressed by the input speech. Our code isavailable at amuse.is.tue.mpg.de.
更多
查看译文
关键词
latent diffusion models,emotional speech-driven gestures,audio emotion disentanglement,gestures,smplx
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要