Ring-linearでLLM爆速🚀長文もへっちゃら！

Published：2025/10/23 6:33:17

Ring-linearでLLM爆速🚀長文もへっちゃら！

超要約: 長い文章もサクサク読める、賢いAIの新しいやり方を発見したってコト💖

✨ ギャル的キラキラポイント ✨ ● Softmax Attention(今までのやり方)より推論コスト1/10だって！家計にも優しいね💸 ● 線形アテンションとSoftmaxを混ぜた、良いとこどりのハイブリッド！賢すぎ🤩 ● Agentシステムとかコード生成とか、色んなことに使えるから、将来性もバッチリ👌

詳細解説いくよ～！

背景最近のLLMはすごいけど、長文を扱うと計算量ヤバくて大変だったの💦 Softmax Attention ってのがネックだったんだけど、Ring-linear が登場して、それが解決されそうなんだって！

続きは「らくらく論文」アプリで

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

Ling Team / Bin Han / Caizhi Tang / Chen Liang / Donghao Zhang / Fan Yuan / Feng Zhu / Jie Gao / Jingyu Hu / Longfei Li / Meng Li / Mingyang Zhang / Peijie Jiang / Peng Jiao / Qian Zhao / Qingyuan Yang / Wenbo Shen / Xinxing Yang / Yalin Zhang / Yankun Ren / Yao Zhao / Yibo Cao / Yixuan Sun / Yue Zhang / Yuchen Fang / Zibin Lin / Zixuan Cheng / Jun Zhou

In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.

cs / cs.LG / cs.AI / cs.CL

Arxivで見る