Published：2025/12/16 11:09:35

EARSでLLM爆速🚀 超速解説！

超要約: LLMの推論（すいろん）を爆速にする技術！
ギャル的キラキラポイント✨
- ● LLM（大規模言語モデル）の動きが超絶スムーズになるってこと💖
- ● 拒否（きょひ）する基準を賢く調整して、無駄をなくす作戦なのね😉
- ● チャットボットとかのサービスが、もっと使いやすくなるって最高じゃん🥳
詳細解説
- 背景: LLMってすごいけど、動きが遅いのがネックだったのよね😭 でも、EARS（イアーズ）っていう新技術のおかげで、LLMの推論速度がめっちゃ速くなるらしい！
- 方法: EARSは、推論の時に「拒否（きょひ）サンプリング」っていう方法を使うんだけど、その拒否する基準を、モデルの予測（よそく）の「不確実性」に応じて変えるんだって！つまり、モデルが自信なさげな時は基準をゆるくして、自信満々の時は厳しくするってこと！頭いい～！
- 結果: この方法のおかげで、推論速度が最大18.12%もアップ⬆️！しかも、精度（せいど）の低下はわずか0.84%っていうから、すごいよね😳
- 意義（ここがヤバい♡ポイント）: LLMを使ったサービスが、もっとサクサク動くようになるってこと！例えば、チャットボットの返事が速くなったり、色んなコンテンツがすぐに作れるようになったりするかも🤩 IT業界全体がもっと盛り上がりそうじゃん？
リアルでの使いみちアイデア💡
- チャットボットで、まるで友達みたいにスムーズな会話ができるようになるかも😍
- 文章作成AIが、まるで手品みたいに、あっという間に記事とか作ってくれるようになるかもね🪄

続きは「らくらく論文」アプリで

Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

Chendong Sun / mingmin Chen / Lei Xu

Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as 1 - max(P_target). By introducing a tolerance term proportional to this uncertainty, EARS intelligently relaxes the acceptance criterion when the model is uncertain, effectively reducing random rejections while maintaining strict standards when the model is confident. Experiments on creative writing and open-domain QA tasks demonstrate that EARS significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark. The method requires no modifications to model architectures and can be seamlessly integrated into existing speculative decoding frameworks.

cs / cs.CL / cs.AI

Arxivで見る