Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

arXiv (Cornell University)(2024)

引用 0|浏览89
暂无评分
摘要
The quadratic complexity and weak length extrapolation of Transformers limitstheir ability to scale to long sequences, and while sub-quadratic solutionslike linear attention and state space models exist, they empiricallyunderperform Transformers in pretraining efficiency and downstream taskaccuracy. We introduce Megalodon, a neural architecture for efficient sequencemodeling with unlimited context length. Megalodon inherits the architecture ofMega (exponential moving average with gated attention), and further introducesmultiple technical components to improve its capability and stability,including complex exponential moving average (CEMA), timestep normalizationlayer, normalized attention mechanism and pre-norm with two-hop residualconfiguration. In a controlled head-to-head comparison with Llama2, Megalodonachieves better efficiency than Transformer in the scale of 7 billionparameters and 2 trillion training tokens. Megalodon reaches a training loss of1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67). Code:https://github.com/XuezheMax/megalodon
更多
查看译文
关键词
Language Modeling,Multilingual Neural Machine Translation,Text Localization,Scene Text Recognition,Document Image Analysis
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要