最強ギャルも納得！GDEPOでAI定理証明が超進化💖

Published：2026/1/11 7:34:41

最強ギャルも納得！GDEPOでAI定理証明が超進化💖

超要約: AIの証明問題を、もっと賢く解くためのスゴ技！✨
ギャル的キラキラポイント:
- ● データムダをなくす神テク！✨
- ● 報酬の偏りをなくす平等精神！🤝
- ● 難しい問題も諦めない！粘り強さ！🔥
詳細解説:
- 背景: AIに数学の証明問題（ATP）を解かせたいんだけど、データがムダになったり、上手くいかないこともあって困ってた😭 LLM（大規模言語モデル）って賢いんだけど、ATP に使うには、もうひと工夫必要だったんだよね！
- 方法: GDEPOっていう新しい方法を開発！3つのスゴ技で、AIの証明能力を爆上げ🚀 まず、データムダをなくす「動的追加サンプリング」。次に、報酬の偏りをなくす「平等右有利性」。最後に、難しい問題にも対応できる「動的追加反復」！
- 結果: GDEPOを使ったら、ATP の問題がめっちゃ速く解けるようになったの！データも有効活用できるし、AI ももっと賢くなったってこと😎🎉
- 意義（ここがヤバい♡ポイント）: AIがもっと賢くなれば、ソフトウェア開発とか、色んな分野で大活躍するようになるかも！例えば、バグ（プログラムのミス）を見つけたり、AIの信頼性を高めたりできるから、私たちの生活がもっと便利になるかもね💕
リアルでの使いみちアイデア:
- 💡 AI家庭教師で、数学の問題をサクサク解けるように！👩‍🏫
- 💡 自動運転の車が、もっと安全になるかも！🚗

続きは「らくらく論文」アプリで

GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning

Zhengqing Yan / Xinyang Liu / Yi Zhang / Fan Guo / Yao Liu / Junchen Wan / Kang Song

Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI), requiring the construction of machine-verifiable proofs in formal languages such as Lean to evaluate AI reasoning capabilities. Reinforcement learning (RL), particularly the high-performance Group Relative Policy Optimization (GRPO) algorithm, has emerged as a mainstream approach for this task. However, in ATP scenarios, GRPO faces two critical issues: when composite rewards are used, its relative advantage estimation may conflict with the binary feedback from the formal verifier; meanwhile, its static sampling strategy may discard entire batches of data if no valid proof is found, resulting in zero contribution to model updates and significant data waste. To address these limitations, we propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO), a method incorporating three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function (based on correctness) from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases. Experiments conducted on three datasets of varying difficulty (MinF2F-test, MathOlympiadBench, PutnamBench) confirm the effectiveness of GDEPO, while ablation studies validate the necessity of its synergistic components. The proposed method enhances data utilization and optimization efficiency, offering a novel training paradigm for ATP.

cs / cs.AI

Arxivで見る