Separating the "chirp" from the "chat": Self-supervised Visual Grounding of Sound and Language

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)(2024)

引用 0|浏览95
暂无评分
摘要
We present DenseAV, a novel dual encoder grounding architecture that learnshigh-resolution, semantically meaningful, and audio-visually aligned featuressolely through watching videos. We show that DenseAV can discover the``meaning'' of words and the ``location'' of sounds without explicitlocalization supervision. Furthermore, it automatically discovers anddistinguishes between these two types of associations without supervision. Weshow that DenseAV's localization abilities arise from a new multi-head featureaggregation operator that directly compares dense image and audiorepresentations for contrastive learning. In contrast, many other systems thatlearn ``global'' audio and video representations cannot localize words andsound. Finally, we contribute two new datasets to improve the evaluation of AVrepresentations through speech and sound prompted semantic segmentation. Onthese and other datasets we show DenseAV dramatically outperforms the prior arton speech and sound prompted semantic segmentation. DenseAV outperforms theprevious state-of-the-art, ImageBind, on cross-modal retrieval using fewer thanhalf of the parameters. Project Page:\href{https://aka.ms/denseav}{https://aka.ms/denseav}
更多
查看译文
关键词
audio-visual,multimodal,vision,language,sound,audio,contrastive learning,unsupervised learning,visual grounding,semantic segmentation,object discovery
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要