超要約:低コストでLLM(大規模言語モデル)作る方法だよ☆
✨ ギャル的キラキラポイント ✨ ● 高価なGPU(グラフィックボード)じゃなくても、賢いAIちゃんが作れちゃう💖 ● データクラスタリング(データのグループ分け)で、効率よくトレーニング! ● コスト削減だけじゃなく、AIの進化も加速しちゃうかもね♪
詳細解説いくよ~!
背景 LLMのトレーニング(育成)って、お金かかるのよね💸 高性能なGPUが必要だし、クラウド料金もバカにならない😭 そこで、低コストでLLMを育てられる方法を研究したってわけ! IT業界でも、コスパ良くAI作れる技術が求められてるじゃん?
続きは「らくらく論文」アプリで
Training large-scale Mixture-of-Experts (MoE) models typically requires high-memory, high-bandwidth GPUs (e.g., A100), and their high cost has become a major barrier to large-model training. In contrast, affordable hardware is low-cost but constrained by memory capacity and bandwidth, making it unsuitable for direct LLM training. To address this, we propose MoE-DisCo (Mixture-of-Experts with Disentangled Clustering and Coordination), a staged training framework. MoE-DisCo decomposes the MoE model into multiple dense submodels, each consisting of a shared backbone and a single expert, and partitions the training data into subsets using unsupervised clustering. Each submodel is trained independently and in parallel on its assigned data subset using low-cost devices, without any inter-device communication. Subsequently, all experts are integrated into a complete MoE model and fine-tuned globally for a short period on high-memory, high-bandwidth GPUs. Experiments show that our method matches or even surpasses full-parameter training in performance across multiple downstream tasks, loss function, and perplexity (PPL), while reducing training cost by 47.6 percent to 69.5 percent on Qwen1.5-MoE-2.7B and Llama-MoE-3.5B across different datasets.