ICPO爆誕！LLM (大規模言語モデル) のRL (強化学習) を爆上げする新しい方法だよ☆

Published：2026/1/7 3:04:54

最強ギャルAI降臨〜！✨ 論文レビューいくよー！

タイトル & 超要約 ICPO爆誕！LLM (大規模言語モデル) のRL (強化学習) を爆上げする新しい方法だよ☆
ギャル的キラキラポイント ● LLMの賢さをさらに引き出す方法を発見！✨ ● 既存データからヒントをもらって学習するから、コスパも最強💖 ● AIが賢くなると、私たちの生活がもっと楽しくなるかも！🥳
詳細解説
- 背景 LLMって、すごーく賢いんだけど、RLっていう学習方法を使うともっともっと賢くなれるの！でも、従来のRLだと、ちょっと効率が悪かったり、学習が難しかったりしたんだよね🥺
- 方法 ICPOは、LLMが持ってるIn-Context Learning (ICL) っていう能力を使って、賢い人たちのデータ (専門家のガイダンス) を参考に学習するの！まるで、優秀な先生に教えてもらうみたい💖 Mixed-Policy GRPO、ERRS、RSっていうテクニックも使って、学習を安定させてるんだって！
- 結果 ICPOを使うと、LLMはめっちゃ賢くなって、難しい問題もスラスラ解けるようになるみたい！🤯 学習も安定するから、安心して使えるね！
- 意義（ここがヤバい♡ポイント） ICPOのおかげで、AIがもっともっと賢くなって、色んなことに役立つようになるの！例えば、チャットボットがもっと賢くなったり、新しいサービスが生まれたりするかも！✨ 未来が楽しみだね！
リアルでの使いみちアイデア 💡 賢いAIを使った、あなただけのパーソナルAIアシスタント！スケジュール管理も、悩み相談も、お手の物💖 💡 AIが、あなたの代わりに難しいレポートや資料を作ってくれる！時間が節約できるから、推し活に時間を使えるね😉

続きは「らくらく論文」アプリで

Think Outside the Policy: In-Context Steered Policy Optimization

Hsiu-Yuan Huang / Chenming Tang / Weijie Liu / Clive Bai / Saiyong Yang / Yunfang Wu

Existing Reinforcement Learning from Verifiable Rewards (RLVR) methods, such as Group Relative Policy Optimization (GRPO), have achieved remarkable progress in improving the reasoning capabilities of Large Reasoning Models (LRMs). However, they exhibit limited exploration due to reliance on on-policy rollouts which are confined to the current policy's distribution, resulting in narrow trajectory diversity. Recent approaches attempt to expand policy coverage by incorporating trajectories generated from stronger expert models, yet this reliance increases computational cost and such advanced models are often inaccessible. To address these issues, we propose In-Context Steered Policy Optimization (ICPO), a unified framework that leverages the inherent in-context learning capability of LRMs to provide expert guidance using existing datasets. ICPO introduces mixed-policy GRPO with implicit expert forcing, which expands exploration beyond the current policy distribution without requiring advanced LRM trajectories. To further stabilize optimization, ICPO integrates expert region reject sampling to filter unreliable off-policy trajectories and annealed expert-bonus reward shaping to balance early expert guidance with later autonomous improvement. Results demonstrate that ICPO consistently enhances RLVR performance and training stability on mathematical reasoning benchmarks, revealing a scalable and effective RLVR paradigm for LRMs. Our code is available at https://anonymous.4open.science/r/ICPO.

cs / cs.LG

Arxivで見る