推論爆上げ！AMIR-GRPO って何？✨

Published：2026/1/7 7:22:58

推論爆上げ！AMIR-GRPO って何？✨

超要約：LLMの推論（すいろん）を爆速で賢くする新しい方法だよ！🚀

● DPO（好み学習）みたいなことできるのに、データ集めが楽ちん！🌟 ● シーケンス（文章）の長さとか気にせず、公平に評価できるよ！✨ ● AIが賢くなるから、色んなサービスがもっと便利になる予感💖

詳細解説

背景 LLM（大規模言語モデル）って、色んなことできるけど、推論はちょい苦手💦 そこで、もっと賢くするために、GRPO（グループ相対ポリシー最適化）って方法が使われてるんだけど、色々問題があったんだよね😢

続きは「らくらく論文」アプリで

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO

Amir Hossein Yari / Fajri Koto

Reinforcement learning has become the primary paradigm for aligning large language models (LLMs) on complex reasoning tasks, with group relative policy optimization (GRPO) widely used in large-scale post-training. However, GRPO faces structural limitations in reasoning-heavy settings: sequence-level advantage normalization introduces systematic length bias, penalties for low-quality trajectories are diluted, and the scalar objective discards rich pairwise preference information embedded in within-group reward rankings. As a result, valuable supervision from costly rollouts remains underutilized. We propose AMIR-GRPO, which augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and transforms each rollout group into a denser set of supervision constraints. Across multiple mathematical reasoning benchmarks, AMIR-GRPO consistently outperforms strong GRPO baselines, yields clearer separation between correct and incorrect reasoning chains, and delivers broader coverage gains beyond the subset of instances solved by standard GRPO.

cs / cs.LG / cs.AI

Arxivで見る