Published：2026/1/11 13:34:44

LLMのRLフレンドリーネス解明！🌟

超要約：LLMと強化学習（RL）の相性、その秘密を解き明かす！💖

1. ギャル的キラキラポイント✨ ● QwenちゃんはRLと相性バッチリだけど、Llamaちゃんはイマイチ…その差を研究！🤔 ● 「分布の明瞭さ」に着目！正解と不正解の区別がハッキリしてるほどRL向きなの💖 ● Sの値（Silhouette Coefficient）を使って、RLの学習方法を調整！賢すぎー！👩‍🎓

2. 詳細解説 背景 LLM（大規模言語モデル）って、文章作ったり翻訳したり、マジすごいじゃん？✨ でも、RL（強化学習）っていう方法でさらに賢くできるって話。でもね、モデルによってRLとの相性が違うの！

方法 LLMのRLフレンドリーネス（相性の良さ）を徹底分析！3つのステップで調べるよ。まず現象を見て、次に原因を探って、最後に解釈するんだって。特に「分布の明瞭さ」に注目！正解と不正解の区別がハッキリしてるほどRLで伸びるらしい💖

続きは「らくらく論文」アプリで

Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models

Shaoning Sun / Mingzhu Cai / Huang He / Bingjin Chen / Siqi Bao / Yujiu Yang / Hua Wu / Haifeng Wang

Language model families exhibit striking disparity in their capacity to benefit from reinforcement learning: under identical training, models like Qwen achieve substantial gains, while others like Llama yield limited improvements. Complementing data-centric approaches, we reveal that this disparity reflects a hidden structural property: \textbf{distributional clarity} in probability space. Through a three-stage analysis-from phenomenon to mechanism to interpretation-we uncover that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses. We quantify this clarity using the \textbf{Silhouette Coefficient} ($S$) and demonstrate that (1) high $S$ correlates strongly with RL performance; (2) low $S$ is associated with severe logic errors and reasoning instability. To confirm this property, we introduce a Silhouette-Aware Reweighting strategy that prioritizes low-$S$ samples during training. Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24. Our work establishes distributional clarity as a fundamental, trainable property underlying RL-Friendliness.

cs / cs.CL / cs.AI / cs.LG

Arxivで見る