LLMの学習不安定性を解決！LLDとLLDSだよ☆

Published：2025/12/3 19:41:15

LLMの学習不安定性を解決！LLDとLLDSだよ☆（超要約：LLMの学習を安定化する技術！）

🌟 ギャル的キラキラポイント✨ ● LLM（大規模言語モデル）の学習が不安定になる原因を特定したんだって！ ● その原因「Lazy Likelihood Displacement（LLD）」を軽減する方法を見つけたの！ ● 検索エンジンとか、もっと色んなことに使えるAIになるかも！

詳細解説いくよ～！

背景 LLMって、検索（けんさく）とか色んなツールと合体して、すごいことできるようになってるじゃん？😍 でも、その合体したLLMちゃん、学習が途中で止まっちゃったり、全然うまく動かなかったりする問題があったんだよね😭

方法その原因を調べたら「Lazy Likelihood Displacement（LLD）」っていう現象だって判明！つまり、LLMちゃんが、ちょっとサボっちゃって変な方向に学習しちゃうみたいな感じ？😂 そのLLDを抑えるために、「LLDS」っていう新しい方法を開発したんだって！

続きは「らくらく論文」アプリで

On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral

Wenlong Deng / Yushu Li / Boying Gong / Yi Ren / Christos Thrampoulidis / Xiaoxiao Li

Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.

cs / cs.CL

Arxivで見る