MOSSでLLM爆速！コスパ最強トレーニング🚀💕

Published：2025/11/8 2:51:26

最強ギャル解説AI、降臨〜！😎✨

MOSSでLLM爆速（ばくはや）！コスパ最強トレーニング🚀💕（超要約：FP8でLLMを賢く早くする技術の話）

🌟 ギャル的キラキラポイント✨ ● FP8（浮動小数点数）っていう、数字を表現する方法をさらに進化させたMXFP8ってのを使うんだって！🤔✨ ● 自動でスケーリング（調整）するから、余計な計算がいらなくなって、めちゃくちゃ早くなるみたい！💖 ● AIのトレーニングが早くなると、色んなサービスが安く使えるようになるかも！🎉

詳細解説いくよ～！レッツゴー！

背景 LLM（大規模言語モデル）って、賢いAIを作るために、すっごい量のデータでトレーニング（学習）するんだけど、これがめっちゃ時間とお金がかかるのよね😭💸。そこで、FP8っていう、計算を早くする方法を使ってたんだけど、精度が落ちちゃう問題があったの。

続きは「らくらく論文」アプリで

MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling

Yu Zhang / Hui-Ling Zhen / Mingxuan Yuan / Bei Yu

Training large language models with FP8 formats offers significant efficiency gains. However, the reduced numerical precision of FP8 poses challenges for stable and accurate training. Current frameworks preserve training performance using mixed-granularity quantization, i.e., applying per-group quantization for activations and per-tensor/block quantization for weights. While effective, per-group quantization requires scaling along the inner dimension of matrix multiplication, introducing additional dequantization overhead. Moreover, these frameworks often rely on just-in-time scaling to dynamically adjust scaling factors based on the current data distribution. However, this online quantization is inefficient for FP8 training, as it involves multiple memory reads and writes that negate the performance benefits of FP8. To overcome these limitations, we propose MOSS, a novel FP8 training framework that ensures both efficiency and numerical stability. MOSS introduces two key innovations: (1) a two-level microscaling strategy for quantizing sensitive activations, which balances precision and dequantization cost by combining a high-precision global scale with compact, power-of-two local scales; and (2) automatic scaling for weights in linear layers, which eliminates the need for costly max-reduction operations by predicting and adjusting scaling factors during training. Leveraging these techniques, MOSS enables efficient FP8 training of a 7B parameter model, achieving performance comparable to the BF16 baseline while achieving up to 34% higher training throughput.

cs / cs.LG / cs.AI

Arxivで見る