iconLogo
Published:2025/11/8 3:04:11

MoEオフローディング、キャッシングで爆速化🚀✨ (IT企業向け)

超要約: MoEモデルをオフロードする際のキャッシュを賢くして、爆速&低コストでLLM動かそ!

🌟 ギャル的キラキラポイント✨ ● MoEモデル (めっちゃデカいAI) を賢く動かす方法を発見したってコト! ● オフローディングの弱点 (遅延) を、キャッシングでカバーする作戦なのね♪ ● IT企業のみんな、コレでAIサービスがもっとイケてるようになるかも!

詳細解説

背景 最近のLLM (大規模言語モデル) は、とんでもなくデカくて高性能! でも、GPUメモリ (頭脳の作業スペース) に入りきらないことも…💦 そこで登場するのがオフローディング! CPUメモリ (控え室) に一部を移動させる方法だよ。 でも、移動に時間がかかるのがネックだった😭

続きは「らくらく論文」アプリで

In-depth Analysis on Caching and Pre-fetching in Mixture of Experts Offloading

Shuning Lin / Yifan He / Yitong Chen

In today's landscape, Mixture of Experts (MoE) is a crucial architecture that has been used by many of the most advanced models. One of the major challenges of MoE models is that they usually require much more memory than their dense counterparts due to their unique architecture, and hence are harder to deploy in environments with limited GPU memory, such as edge devices. MoE offloading is a promising technique proposed to overcome this challenge, especially if it is enhanced with caching and pre-fetching, but prior work stopped at suboptimal caching algorithm and offered limited insights. In this work, we study MoE offloading in depth and make the following contributions: 1. We analyze the expert activation and LRU caching behavior in detail and provide traces. 2. We propose LFU caching optimization based on our analysis and obtain strong improvements from LRU. 3. We implement and experiment speculative expert pre-fetching, providing detailed trace showing its huge potential . 4. In addition, our study extensively covers the behavior of the MoE architecture itself, offering information on the characteristic of the gating network and experts. This can inspire future work on the interpretation of MoE models and the development of pruning techniques for MoE architecture with minimal performance loss.

cs / cs.LG / cs.AI