SonicMoEでLLM学習爆速化🚀✨

Published：2025/12/16 4:39:10

SonicMoE、LLM学習爆速化🚀✨

タイトル & 超要約 SonicMoEでLLM学習を爆速に！🚀 コスト削減、精度もUP！
ギャル的キラキラポイント✨ ● LLM（大規模言語モデル）の学習がめっちゃ速くなるよ！ ● GPU（画像処理プロセッサ）の無駄を省いて効率UP！ ● コスト削減だけじゃなく、精度もキープできるとか最強！
詳細解説
- 背景 LLMってどんどん賢くなるけど、学習コストがバカ高い💸 SonicMoEは、その学習を高速化する魔法🧙‍♀️みたいな技術！モデルのパラメータ（設定みたいなもの）を増やしても、計算コストを抑えれるのがすごい！
- 方法「Mixture of Experts (MoE)」っていうモデルの学習効率を上げるために、メモリの使い方を見直したり、GPUの処理を最適化したりしたんだって！トークン（文章の最小単位）の処理も賢くしたみたい💡
- 結果活性化メモリの使用量を最大45%削減！ Hopper GPU上で、7Bパラメータのモデル学習が1.86倍速くなったって！スパース性（情報がまばらな状態）が高くても精度を保てるのが神✨
- 意義（ここがヤバい♡ポイント） LLM開発のハードルが下がるから、色んな企業がAI技術をもっと活用できるようになるね！ AIチャットボットとか、コンテンツ生成とか、色んなサービスが進化する未来が楽しみ💖
リアルでの使いみちアイデア💡
- AIチャットボットがもっと賢く、サクサク動くようになるかも！
- 文章作成が爆速になるツールが出てくるかも！論文とかレポートも楽々～♪

続きは「らくらく論文」アプリで

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

Wentao Guo / Mayank Mishra / Xinle Cheng / Ion Stoica / Tri Dao

Mixture of Experts (MoE) models have emerged as the de facto architecture for scaling up language models without significantly increasing the computational cost. Recent MoE models demonstrate a clear trend towards high expert granularity (smaller expert intermediate dimension) and higher sparsity (constant number of activated experts with higher number of total experts), which improve model quality per FLOP. However, fine-grained MoEs suffer from increased activation memory footprint and reduced hardware efficiency due to higher IO costs, while sparser MoEs suffer from wasted computations due to padding in Grouped GEMM kernels. In response, we propose a memory-efficient algorithm to compute the forward and backward passes of MoEs with minimal activation caching for the backward pass. We also design GPU kernels that overlap memory IO with computation benefiting all MoE architectures. Finally, we propose a novel "token rounding" method that minimizes the wasted compute due to padding in Grouped GEMM kernels. As a result, our method SonicMoE reduces activation memory by 45% and achieves a 1.86x compute throughput improvement on Hopper GPUs compared to ScatterMoE's BF16 MoE kernel for a fine-grained 7B MoE. Concretely, SonicMoE on 64 H100s achieves a training throughput of 213 billion tokens per day comparable to ScatterMoE's 225 billion tokens per day on 96 H100s for a 7B MoE model training with FSDP-2 using the lm-engine codebase. Under high MoE sparsity settings, our tile-aware token rounding algorithm yields an additional 1.16x speedup on kernel execution time compared to vanilla top-$K$ routing while maintaining similar downstream performance. We open-source all our kernels to enable faster MoE model training.

cs / cs.LG / cs.AI

Arxivで見る