From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2024)

引用 12|浏览687
暂无评分
摘要
We present a framework for generating full-bodied photorealistic avatars thatgesture according to the conversational dynamics of a dyadic interaction. Givenspeech audio, we output multiple possibilities of gestural motion for anindividual, including face, body, and hands. The key behind our method is incombining the benefits of sample diversity from vector quantization with thehigh-frequency details obtained through diffusion to generate more dynamic,expressive motion. We visualize the generated motion using highlyphotorealistic avatars that can express crucial nuances in gestures (e.g.sneers and smirks). To facilitate this line of research, we introduce afirst-of-its-kind multi-view conversational dataset that allows forphotorealistic reconstruction. Experiments show our model generates appropriateand diverse gestures, outperforming both diffusion- and VQ-only methods.Furthermore, our perceptual evaluation highlights the importance ofphotorealism (vs. meshes) in accurately assessing subtle motion details inconversational gestures. Code and dataset available online.
更多
查看译文
关键词
gestures,generative motion,multimodal,face,body,hands
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要