モバイルLLM爆速化！NPUとSDでRAGをブチ上げ🚀

Published：2025/12/17 5:59:04

モバイルLLM爆速化！NPUとSDでRAGをブチ上げ🚀

超要約: モバイルLLM（頭脳）をNPU（専用チップ）とSD（高速化技）で爆速にしたった！RAG（情報検索）もサクサクだよ！✨

✨ ギャル的キラキラポイント ✨ ● モバイルRAGを爆速化して、スマホがもっと賢くなるって、マジ卍じゃん？💖 ● NPUとSDの組み合わせで、スマホの電池持ちも良くなるの、神すぎ！🔋 ● 企業はモバイルAIで、新しいビジネスチャンスを掴めるって、最高潮！💰

詳細解説いくよ～！🎤

背景スマホでAI使いたいけど、処理遅かったり、電池すぐ無くなるのって、萎えるよね？😭 でも、NPUっていうAI専用チップを使えば、高速処理＆省エネが可能になるらしい！

続きは「らくらく論文」アプリで

Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution

Zhiyang Chen / Daliang Xu / Haiyang Shen / Chiheng Lou / Mengwei Xu / Shangguang Wang / Xin Jin / Yun Ma

Performing Retrieval-Augmented Generation (RAG) directly on mobile devices is promising for data privacy and responsiveness but is hindered by the architectural constraints of mobile NPUs. Specifically, current hardware struggles with the variable workloads intrinsic to RAG: the transition between processing extensive contexts and generating tokens incurs significant overhead due to static graph constraints, while the memory-bound generation phase leaves computational resources underutilized. In this work, we propose a holistic acceleration framework sd.npu, designed to maximize NPU efficiency for on-device RAG ecosystem. To address the latency caused by NPU graph switching during phase transitions, we introduce a pipelined execution strategy. This approach masks the overhead of model reconfiguration by parallelizing the loading of decoding graphs with the computation of partitioned context chunks (chunked prefill), thereby ensuring continuous execution flow. Furthermore, to mitigate low hardware utilization during the decoding phase, we develop an NPU-centric speculative decoding mechanism. By calibrating generation distributions and extending draft sequences, our method effectively converts idle NPU cycles into valid token throughput. Experiments on commercial smartphones show that our framework significantly outperforms existing baselines, delivering 1.06$\times$--3.81$\times$ speedups and 1.07$\times$--4.71$\times$ energy savings across various RAG tasks.

cs / cs.CL

Arxivで見る