数学LLM爆上げ！Verifier-Guided DPO🚀

Published：2025/12/17 6:15:52

数学LLM爆上げ！Verifier-Guided DPO🚀

超要約: 数学LLM（大規模言語モデル）を強化する新テク！見かけ倒しの不正解を見抜き、DPOで賢くするよ😉
ギャル的キラキラポイント✨
- ● 間違えやすいとこ、見抜いちゃう！😎 MathVerifierが、ハードネガティブ（一見正解に見えるけど実は違うやつ）を見つけるの！
- ● DPOで賢く！✨ 報酬モデルとか使わないから、お手軽に数学力アップ⤴️
- ● いろんな分野で使える！💡 教育、金融…LLMの活躍の場が広がる予感💕
詳細解説
- 背景: LLMって数学苦手じゃん？計算ミスとか、論理破綻とか、よくあるある🤣 でも、見た目は正解に見えたりするんだよね💦 それを見つけるのが大変だった！
- 方法: MathVerifierっていう、すごいやつを使うの！6つの視点からLLMの答えをチェック👀 間違ってる部分を見つけて、DPO（Direct Preference Optimization）で学習させるんだって！
- 結果: 報酬モデルとか使わないから、コストも抑えられて良い感じ👍 しかも、LLMの数学力、めっちゃ上がるらしい！賢くなるってことね💖
- 意義: 数学力アップで、AIの活躍の場が広がる！教育、金融、いろんな分野で使えるようになるから、めっちゃすごいことなの！これからのAI、楽しみだね😍
リアルでの使いみちアイデア💡
- 数学アプリで、間違えやすいポイントを教えてくれる機能！💯 間違いを分析して、アドバイスくれるから、勉強捗りそう💖
- 金融系のAIで、お金の計算をめちゃくちゃ正確にしてくれる！💰 ミスがないから、安心して使えるね✨

続きは「らくらく論文」アプリで

Hard Negative Sample-Augmented DPO Post-Training for Small Language Models

Haocheng Lu / Minjun Zhu / Henry Yu

Large language models (LLMs) continue to struggle with mathematical reasoning, and common post-training pipelines often reduce each generated solution to a binary outcome: correct or incorrect. This perspective is limiting in practice, as failures in chain-of-thought (CoT) reasoning are frequently structured; solutions may appear convincing while containing subtle logical, algebraic, or numerical flaws. Meanwhile, reinforcement learning from human feedback (RLHF) variants that rely on large reward models or LLM-as-a-judge signals are often expensive, difficult to scale, and unstable to iterate. We propose a lightweight and pragmatic post-training pipeline that targets such structured errors under realistic compute budgets. Starting from supervised fine-tuning (SFT) on MetaMathQA-style CoT data, we introduce a compact MathVerifier that decomposes a candidate solution into a six-dimensional error profile and aggregates it into interpretable wrongness and absurdity scores. These verifier signals serve two roles: (i) mining hard negatives that are near-correct yet structurally flawed, and (ii) defining per-sample importance weights that emphasize the most informative preference pairs. We integrate both into an offline Direct Preference Optimization (DPO) objective via a verifier-guided weighted formulation. Experiments on a 1.5B-parameter Qwen2.5 model show that verifier-guided, weighted DPO yields more targeted improvements than vanilla SFT and unweighted DPO, particularly on problems where solutions are numerically close to correct but logically inconsistent, while avoiding the overhead of training large reward models or relying on external judges.

cs / cs.LG

Arxivで見る