Vito: Vision Transformer Optimization Via Knowledge Distillation on Decoders
2024 IEEE International Conference on Image Processing (ICIP)(2024)
摘要
In this paper, we propose ViTO, a novel knowledge distillation strategy that aims to convert a CNN model into a transformer-based counterpart that incorporates the advantages of transformers while retaining or improving its inductive bias. Our approach is based on a two-level transformer architecture that includes an inner model for learning visual representations and an outer model that aims to match the teacher’s predictions through autoregression. Specifically, given an image in a batch, the outer model classifies the image by using, in addition to the image’s visual properties, also the predictions it has made on images previously seen within the same batch. The effect of this strategy is to allow the transformer to estimate self- and cross-attention across all input batch images to learn autoregressively intra-class and inter-class correlations.We experimentally validate ViTO on several standard benchmarks obtaining better performance than existing knowledge distillation strategies on transformers. Furthermore, our distilled transformer-based model shows better robustness properties than standard vision transformers, demonstrating the effectiveness of our proposed distillation strategy.
更多查看译文
关键词
Inductive bias,Autoregression,Sequence models
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要