MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 14(2024)

引用 0|浏览38
暂无评分
摘要
The application of mixture-of-experts (MoE) is gaining popularity due to itsability to improve model's performance. In an MoE structure, the gate layerplays a significant role in distinguishing and routing input features todifferent experts. This enables each expert to specialize in processing theircorresponding sub-tasks. However, the gate's routing mechanism also gives riseto narrow vision: the individual MoE's expert fails to use more samples inlearning the allocated sub-task, which in turn limits the MoE to furtherimprove its generalization ability. To effectively address this, we propose amethod called Mixture-of-Distilled-Expert (MoDE), which applies moderate mutualdistillation among experts to enable each expert to pick up more featureslearned by other experts and gain more accurate perceptions on their originalallocated sub-tasks. We conduct plenty experiments including tabular, NLP andCV datasets, which shows MoDE's effectiveness, universality and robustness.Furthermore, we develop a parallel study through innovatively constructing"expert probing", to experimentally prove why MoDE works: moderate distillingknowledge can improve each individual expert's test performances on theirassigned tasks, leading to MoE's overall performance improvement.
更多
查看译文
关键词
Expert Finding,Expertise Identification,Knowledge Sharing,Channel-Aware Fusion
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要