iconLogo
Published:2026/1/11 13:34:44

LLMのRLフレンドリーネス解明!🌟

超要約:LLMと強化学習(RL)の相性、その秘密を解き明かす!💖

1. ギャル的キラキラポイント✨ ● QwenちゃんはRLと相性バッチリだけど、Llamaちゃんはイマイチ…その差を研究!🤔 ● 「分布の明瞭さ」に着目!正解と不正解の区別がハッキリしてるほどRL向きなの💖 ● Sの値(Silhouette Coefficient)を使って、RLの学習方法を調整!賢すぎー!👩‍🎓

2. 詳細解説 背景 LLM(大規模言語モデル)って、文章作ったり翻訳したり、マジすごいじゃん?✨ でも、RL(強化学習)っていう方法でさらに賢くできるって話。でもね、モデルによってRLとの相性が違うの!

方法 LLMのRLフレンドリーネス(相性の良さ)を徹底分析!3つのステップで調べるよ。まず現象を見て、次に原因を探って、最後に解釈するんだって。特に「分布の明瞭さ」に注目!正解と不正解の区別がハッキリしてるほどRLで伸びるらしい💖

続きは「らくらく論文」アプリで

Distributional Clarity: The Hidden Driver of RL-Friendliness in Large Language Models

Shaoning Sun / Mingzhu Cai / Huang He / Bingjin Chen / Siqi Bao / Yujiu Yang / Hua Wu / Haifeng Wang

Language model families exhibit striking disparity in their capacity to benefit from reinforcement learning: under identical training, models like Qwen achieve substantial gains, while others like Llama yield limited improvements. Complementing data-centric approaches, we reveal that this disparity reflects a hidden structural property: \textbf{distributional clarity} in probability space. Through a three-stage analysis-from phenomenon to mechanism to interpretation-we uncover that RL-friendly models exhibit intra-class compactness and inter-class separation in their probability assignments to correct vs. incorrect responses. We quantify this clarity using the \textbf{Silhouette Coefficient} ($S$) and demonstrate that (1) high $S$ correlates strongly with RL performance; (2) low $S$ is associated with severe logic errors and reasoning instability. To confirm this property, we introduce a Silhouette-Aware Reweighting strategy that prioritizes low-$S$ samples during training. Experiments across six mathematical benchmarks show consistent improvements across all model families, with gains up to 5.9 points on AIME24. Our work establishes distributional clarity as a fundamental, trainable property underlying RL-Friendliness.

cs / cs.CL / cs.AI / cs.LG