iconLogo
Published:2025/12/3 15:06:06

LLMの「忘却」問題、RLで解決! 💖(新規事業向け)

超要約:LLM(大規模言語モデル)の学習、RL(強化学習)なら忘却しにくいよ!新しいこと覚えさせても、今までの知識なくなりにくいってこと✨

✨ ギャル的キラキラポイント ✨

● SFT(教師あり学習)よりRL(強化学習)の方が、既存の知識を忘れにくいってこと!すごい! ● オンポリシーデータ(RLが使うデータ)が良いらしい! LLMのモードを保てるから、忘れにくいんだって! ● IT業界でLLMをもっと賢く使えるようになるかも!✨ 色んなサービスがもっと良くなるってことね♪

詳細解説

続きは「らくらく論文」アプリで

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

Howard Chen / Noam Razin / Karthik Narasimhan / Danqi Chen

Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the mode-seeking nature of RL, which stems from its use of on-policy data, enables keeping prior knowledge intact when learning the target task. We then verify this insight by demonstrating that the use on-policy data underlies the robustness of RL to forgetting in practical settings, as opposed to other algorithmic choices such as the KL regularization or advantage estimation. Lastly, as a practical implication, our results highlight the potential of mitigating forgetting using approximately on-policy data, which can be substantially more efficient to obtain than fully on-policy data.

cs / cs.LG / cs.CL