A Generative Approach to Audio-Visual Generalized Zero-Shot Learning: Combining Contrastive and Discriminative Techniques

IJCNN（2023）

引用 1|浏览8

暂无评分

摘要

Audio-visual generalized zero-shot learning (AV-GZSL) for video classification is a task where the model learns to identify unseen video classes from multimodal audio-visual inputs. This is a combination of two equally challenging tasks where one tries to perform video classification from two different modalities of inputs, and the other pushes to achieve this in a zero-shot setting. The natural alignment between audio and visual modalities is the key to addressing this relatively unexplored task. The predominant approach in AV-GZSL has been to learn better cross-modal attention between the two input domains and leverage large language pretraining. However, for better attention and pretraining, there exists a semantic gap between the embedding of different modalities that requires a more diverse and less sparse representation of the joint embedding space. To overcome this, we propose a complementary approach to the existing research direction, where we simulate unseen audiovisual features using a generative model, and regulate it by combining contrastive and discriminative loss. To demonstrate the effectiveness of our approach, we benchmark our model on VGGSound-GZSL, ActivityNet-GZSL, and UCF-GZSL and report state-of-the-art performance, and qualitatively show that unseen classes are better clustered together with our generative approach.

查看译文

关键词

Audio-visual multimodal learning,generalized zero-shot learning,video classification,feature synthesis,contrastive learning,discriminative learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要