データ爆速化！MoEモデルを覚醒させる魔法🪄

Published：2025/12/26 14:16:11

データ爆速化！MoEモデルを覚醒させる魔法🪄

超要約: FUSCOはデータ転送を神速にする技術！MoEモデルを爆速化🚀
ギャル的キラキラポイント✨
- ● データ転送を爆速にする秘密兵器！既存のやり方より断然イイ💖
- ● MoEモデル（専門家がいっぱいいるモデル）の学習と推論が劇的に早く💨
- ● 大規模モデルも余裕！ AIサービスがもっと身近になるかも✨
詳細解説
- 背景: 最近のAIモデルは巨大化してるけど、計算が大変💦特に、モデル内のデータ移動（シャッフル）が遅いと困る…😰
- 方法: FUSCOは、データ変換と通信を合体！データ構造の情報も一緒に送ることで、無駄を省いてスピードアップ🚀
- 結果: シャッフル時間が大幅短縮！MoEモデルの学習も推論も、めっちゃ速くなったってこと🤩
- 意義: AIの進化をブースト！高性能なAIが、もっと色んなとこで使えるようになるかも💖
リアルでの使いみちアイデア💡
- チャットボットが爆速！会話がスムーズになって、マジ神😇
- 画像生成も秒速！理想の画像がすぐ手に入るって、最強じゃん？🫶

続きは「らくらく論文」アプリで

FUSCO: High-Performance Distributed Data Shuffling via Transformation-Communication Fusion

Zhuoran Zhu / Chunyang Zhu / Hao Lin / Xu Fu / Yiming Zhou / Quanlu Zhang / Zhenhua Li / Feng Qian / Chao Yu / Boxun Li / Guohao Dai / Yu Wang

Large-scale Mixture-of-Experts (MoE) models rely on \emph{expert parallelism} for efficient training and inference, which splits experts across devices and necessitates distributed data shuffling to route each token to its assigned experts. However, existing communication libraries handle this shuffling poorly; its overhead can account for over half of end-to-end runtime. We present FUSCO, an MoE-friendly communication library that achieves efficient and lightweight data shuffling through fused data transformation and communication, based on the key observation that MoE's expert-major data layout conflicts with the device-major layout expected by communication operations. FUSCO captures the fine-grained data layout, which is then interpreted by a pipelined communication engine that performs the required shuffling efficiently along the communication path. Lightweight planning and load-balancing mechanisms complement the engine by eliminating redundant communication and dispersing traffic. Evaluations on representative benchmarks illustrate that FUSCO achieves up to 3.84$\times$ and 2.01$\times$ speedups over NCCL and DeepEP (the state-of-the-art MoE communication library), respectively. In end-to-end MoE tasks, compared to NCCL and DeepEP, FUSCO reduces the training latency by 1.17-1.39$\times$ and 1.10-1.19$\times$, and lowers the first-token generation latency in inference by 1.09-1.25$\times$ and 1.06-1.16$\times$.

cs / cs.DC

Arxivで見る