LLM爆誕✨SFTとRL-FTの関係性を解き明かす！

Published：2025/8/22 17:10:37

LLM爆誕✨SFTとRL-FTの関係性を解き明かす！

超要約：LLMの学習法SFTとRL-FTの関係を分析。OOD（分布外データ）性能UPのヒントを発見💖

✨ギャル的キラキラポイント✨ ● SFT（教師あり学習）とRL-FT（強化学習）の関係性を詳しく調べたんだって！すごーい！ ● LLM（大規模言語モデル）の学習を効率よくするヒントが見つかるかも！期待大だね！ ● OOD（分布外データ）に対するLLMの弱点を克服できる可能性も…？最強じゃん！

詳細解説背景 LLMの学習って大変じゃん？スクラッチ（最初から）でやるのは大変だから、SFTとRL-FTで微調整（ファインチューニング）するのが主流になってきたんだよね！SFTは特定のタスクに強くなるけど、OOD性能は下がりがち…。そこでRL-FTでOODをカバーしよう！って研究だよ✨

方法 LLMのパラメータ（モデルの調整可能な値）の動きを分析🔍SFTとRL-FTが、モデルの内部構造にどんな影響を与えているのか、詳しく調べたんだって！特に、特異ベクトル（モデルの重要な要素）の回転に注目👀24-pointカードゲームっていうOODの評価に使われる新しいベンチマークも使ってるよ！

続きは「らくらく論文」アプリで

RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs

Hangzhan Jin / Sicheng Lv / Sifan Wu / Mohammad Hamdaqa

Training large language models (LLMs) from scratch is increasingly impractical, making post-training methods such as supervised fine-tuning (SFT) and reinforcement-learning fine-tuning (RL-FT, e.g., PPO) central to modern practice. Using an out-of-distribution (OOD) variant of the 24-point card game and new spectrum-based diagnostics, we revisit how these two stages reshape model representation and OOD performance. Our key findings are- (1) RL-FT can restore much of the OOD performance loss from SFT (e.g., Llama-11B 8.97% to 15.38%, Qwen-7B 17.09% to 19.66%). But when SFT induces severe overfitting and a clear distribution shift, RL-FT cannot fully recover OOD performance. (2) Direction shifts of singular vectors matter more than singular value magnitudes. These shifts concentrate on directions linked to the largest and smallest singular values, leaving the bulk spectrum intact. (3) Low-rank and shallow recovery is effective: restoring singular vector directions for the top 20% of values or first 25% of layers recovers 70-80% of OOD performance. (4) Stronger SFT checkpoints enable better recovery by RL, while overfitted ones resist restoration. These results reconcile prior reports of RL superior OOD performance: RL primarily counteracts SFT-induced directional drift rather than finding new solutions. Our spectrum-aware analysis highlights inexpensive recovery knobs low-rank UV merging and shallow-layer resets that practitioners can use before costly RL fine-tuning.

cs / cs.LG / cs.AI

Arxivで見る