iconLogo
Published:2025/10/23 9:30:23

タイトル & 超要約:📱爆速LLM!モバイルAI革命✨

ギャル的キラキラポイント✨

  • ● モバイルLLM(大規模言語モデル)を爆速にする技術なの!スマホでのAIが超賢くなるってコト💖
  • ● 投機的デコーディング(未来を予測!)とNPU(スマホの頭脳)のコラボがスゴすぎ👏
  • ● スマホAIの動きがヌルサクになるから、使い心地がマジで神レベルになるよ~ん😎

詳細解説

背景 スマホで賢いAIを使いたいけど、動きが遅いとイライラするよね?😤 この研究は、スマホのAIをめちゃくちゃ早く動かすための技術についてなんだ!LLMって頭脳を、スマホのNPUってチップで動かすんだけど、動きが遅いって問題があったの。

続きは「らくらく論文」アプリで

Accelerating Mobile Language Model via Speculative Decoding and NPU-Coordinated Execution

Zhiyang Chen / Daliang Xu / Haiyang Shen / Mengwei Xu / Shangguang Wang / Yun Ma

Enhancing on-device large language models (LLMs) with contextual information from local data enables personalized and task-aware generation, powering use cases such as intelligent assistants and UI agents. While recent developments in neural processors have substantially improved the efficiency of prefill on mobile devices, the token-by-token generation process still suffers from high latency and limited hardware utilization due to its inherently memory-bound characteristics. This work presents sd.npu, a mobile inference framework that integrates speculative decoding with dynamic hardware scheduling to accelerate context-aware text generation on mobile devices. The framework introduces three synergistic components: (1) adaptive execution scheduling, which dynamically balances compute graphs between prefill and decoding phases; (2) context-aligned drafting, which improves speculative efficiency through lightweight online calibration to current tasks; and (3) hardware-efficient draft extension, which reuses and expands intermediate sequences to improve processing parallelism and reduce verification cost. Experiments on multiple smartphones and representative workloads show consistent improvements of up to 3.8x in generation speed and 4.7x in energy efficiency compared with existing mobile inference solutions. Component-level analysis further validates the contribution of each optimization.

cs / cs.CL