A Generative Approach to Audio-Visual Generalized Zero-Shot Learning: Combining Contrastive and Discriminative Techniques

Qichen Zheng,Jie Hong,Moshiur Farazi

IJCNN(2023)

引用 1|浏览8
暂无评分
摘要
Audio-visual generalized zero-shot learning (AV-GZSL) for video classification is a task where the model learns to identify unseen video classes from multimodal audio-visual inputs. This is a combination of two equally challenging tasks where one tries to perform video classification from two different modalities of inputs, and the other pushes to achieve this in a zero-shot setting. The natural alignment between audio and visual modalities is the key to addressing this relatively unexplored task. The predominant approach in AV-GZSL has been to learn better cross-modal attention between the two input domains and leverage large language pretraining. However, for better attention and pretraining, there exists a semantic gap between the embedding of different modalities that requires a more diverse and less sparse representation of the joint embedding space. To overcome this, we propose a complementary approach to the existing research direction, where we simulate unseen audiovisual features using a generative model, and regulate it by combining contrastive and discriminative loss. To demonstrate the effectiveness of our approach, we benchmark our model on VGGSound-GZSL, ActivityNet-GZSL, and UCF-GZSL and report state-of-the-art performance, and qualitatively show that unseen classes are better clustered together with our generative approach.
更多
查看译文
关键词
Audio-visual multimodal learning,generalized zero-shot learning,video classification,feature synthesis,contrastive learning,discriminative learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要