超要約:GPUを使い倒してMoEモデルのトレーニングを爆速にする方法だよ!
✨ ギャル的キラキラポイント ✨ ● GPUの無駄をなくして、爆速トレーニングを実現✨ ● レイテンシ(遅延)を大幅に短縮しちゃう! ● タスクの局所性を活かして、もっと賢く💡
詳細解説いくよ~!
背景 LLM(大規模言語モデル)って、どんどんデカくなってるじゃん?💰 でも、デカいモデルをトレーニング(訓練)するには、時間もお金もめっちゃかかる💦 そこで、MoEモデル(Mixture-of-Experts)の登場!✨ ただ、既存のMoEはGPUの性能をフルに活かせてなかったんだよね…😢
続きは「らくらく論文」アプリで
The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from low GPU utilization, significant latency overhead, and a fundamental inability to leverage task locality, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a single persistent GPU kernel. FlashMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashMoE eliminates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thereby unlocking payload efficiency by eliminating bloated or redundant network payloads in sparsely activated layers. When evaluated on an 8-H100 GPU node with MoE models comprising up to 128 experts and 16K token sequences, FlashMoE achieves up to 9x higher GPU utilization, 6x lower latency, 5.7x higher throughput, and 4x better overlap efficiency compared to state-of-the-art baselines, despite using FP32, whereas the baselines use FP16. FlashMoE shows that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML. We provide code at https://github.com/osayamenja/FlashMoE.