iconLogo
Published:2026/1/7 3:24:59

DRA-GRPOでLLMの数理力を爆上げ!🚀

超要約: LLMの計算能力を上げる新技!多様な答えを評価して、賢くするよ!✨

ギャル的キラキラポイント✨

● GRPOに多様性(ダイバーシティ)の概念をプラス!🥳 ● データ少なめでもOK!コスパ最強✨ ● いろんな分野で活躍できるポテンシャル!未来が楽しみ♪

詳細解説

背景 LLM(大規模言語モデル)って、すごい頭脳を持ってるけど、計算問題はちょっぴり苦手だったり🤔 でも、GRPO(強化学習の一種)で訓練すると、賢くなるって判明! でも、GRPOだけだと、答えのパターンが偏っちゃう問題が…🌀

方法 そこで登場!DRA(Diversity-aware Reward Adjustment)✨ GRPOに、答えの「多様性」を評価する機能をプラスしたんだ!SMI(Submodular Mutual Information)を使って、色んな答えのパターンをチェックしてるらしいよ!🔍

続きは「らくらく論文」アプリで

DRA-GRPO: Your GRPO Needs to Know Diverse Reasoning Paths for Mathematical Reasoning

Xiwen Chen / Wenhui Zhu / Peijie Qiu / Xuanzhao Dong / Hao Wang / Haiyu Wu / Huayu Li / Aristeidis Sotiras / Yalin Wang / Abolfazl Razi

Post-training LLMs with Reinforcement Learning, specifically Group Relative Policy Optimization (GRPO), has emerged as a paradigm for enhancing mathematical reasoning. However, standard GRPO relies on scalar correctness rewards that are often non-injective with respect to semantic content: distinct reasoning paths receive identical rewards. This leads to a Diversity-Quality Inconsistency, where the policy collapses into a narrow set of dominant modes while ignoring equally valid but structurally novel strategies. To bridge this gap, we propose Diversity-aware Reward Adjustment (DRA), a theoretically grounded framework that calibrates the reward signal using the semantic density of sampled groups. By leveraging Submodular Mutual Information (SMI), DRA implements an Inverse Propensity Scoring (IPS) mechanism that effectively de-biases the gradient estimation. This creates a repulsive force against redundancy, driving the policy to achieve better coverage of the high-reward landscape. Our method is plug-and-play and integrates seamlessly with GRPO variants. Empirical evaluations on five math benchmarks demonstrate that DRA-GRPO consistently outperforms strong baselines, achieving an average accuracy of 58.2% on DeepSeek-R1-Distill-Qwen-1.5B with only 7,000 training samples and $55 cost, highlighting the critical role of diversity calibration in data-efficient alignment.

cs / cs.CL / cs.LG