iconLogo
Published:2026/1/7 7:22:58

推論爆上げ!AMIR-GRPO って何?✨

超要約:LLMの推論(すいろん)を爆速で賢くする新しい方法だよ!🚀

● DPO(好み学習)みたいなことできるのに、データ集めが楽ちん!🌟 ● シーケンス(文章)の長さとか気にせず、公平に評価できるよ!✨ ● AIが賢くなるから、色んなサービスがもっと便利になる予感💖

詳細解説

背景 LLM(大規模言語モデル)って、色んなことできるけど、推論はちょい苦手💦 そこで、もっと賢くするために、GRPO(グループ相対ポリシー最適化)って方法が使われてるんだけど、色々問題があったんだよね😢

続きは「らくらく論文」アプリで

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO

Amir Hossein Yari / Fajri Koto

Reinforcement learning has become the primary paradigm for aligning large language models (LLMs) on complex reasoning tasks, with group relative policy optimization (GRPO) widely used in large-scale post-training. However, GRPO faces structural limitations in reasoning-heavy settings: sequence-level advantage normalization introduces systematic length bias, penalties for low-quality trajectories are diluted, and the scalar objective discards rich pairwise preference information embedded in within-group reward rankings. As a result, valuable supervision from costly rollouts remains underutilized. We propose AMIR-GRPO, which augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and transforms each rollout group into a denser set of supervision constraints. Across multiple mathematical reasoning benchmarks, AMIR-GRPO consistently outperforms strong GRPO baselines, yields clearer separation between correct and incorrect reasoning chains, and delivers broader coverage gains beyond the subset of instances solved by standard GRPO.

cs / cs.LG / cs.AI