超要約:LLMの頭脳🧠をRLで高速化!コスト削減も夢じゃないって話✨
✨ ギャル的キラキラポイント ✨
● RL(強化学習)でLLMの動きを賢くコントロール!賢い子がさらに賢くなるってコト💖 ● ドラフトモデル呼び出しを動的に調整!無駄を省いて効率アップ⤴️ ● 既存手法より3〜5倍速!爆速LLMで、みんなをアゲてこー🎵
詳細解説
続きは「らくらく論文」アプリで
Inference with modern Large Language Models (LLMs) is expensive and slow, and speculative sampling has emerged as an effective solution to this problem, however, the number of the calls to the draft model for generating candidate tokens in speculative sampling is a preset hyperparameter, lacking flexibility. To generate and utilize the candidate tokens more effectively, we propose RADAR, a novel speculative sampling method with RL-based dynamic draft trees. RADAR formulates the draft tree generation process as a Markov Decision Process (MDP) and employs offline reinforcement learning to train a prediction model, which enables real-time decision on the calls to the draft model, reducing redundant computations and further accelerating inference. Evaluations across three LLMs and four tasks show that RADAR achieves a speedup of 3.17x-4.82x over the auto-regressive decoding baseline. The code is available at https://github.com/minaduki-sora/RADAR.