iconLogo
Published:2026/1/8 14:49:10

間違いから学ぶLLM!ドメイン外も最強に🔥

超要約:LLM、間違った答えも活用して賢くなるって話💖

✨ ギャル的キラキラポイント ✨ ● 負けイベ(負の推論)も無駄じゃないってコト!LLMの弱点克服~! ● 新しい学習方法「GLOW」爆誕!学習効率もUPしちゃうよ🎵 ● IT業界のAIさん達、ますます活躍の予感!サービス爆上がり🚀

詳細解説いくよ~!

背景 LLM(大規模言語モデル)ってすごいけど、知らない分野(ドメイン)のことにはちょっぴり弱い…😱 そんな弱点を克服したいんだよね!

続きは「らくらく論文」アプリで

Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization

Xueyun Tian (CAS Key Laboratory of AI Safety / Institute of Computing Technology / CAS / Beijing / China / University of Chinese Academy of Sciences / Beijing / China) / Minghua Ma (Harbin Institute of Technology / Harbin / China) / Bingbing Xu (CAS Key Laboratory of AI Safety / Institute of Computing Technology / CAS / Beijing / China / Tsinghua University / Beijing / China) / Nuoyan Lyu (CAS Key Laboratory of AI Safety / Institute of Computing Technology / CAS / Beijing / China / University of Chinese Academy of Sciences / Beijing / China) / Wei Li (Tsinghua University / Beijing / China) / Heng Dong (Tsinghua University / Beijing / China) / Zheng Chu (Harbin Institute of Technology / Harbin / China) / Yuanzhuo Wang (CAS Key Laboratory of AI Safety / Institute of Computing Technology / CAS / Beijing / China) / Huawei Shen (CAS Key Laboratory of AI Safety / Institute of Computing Technology / CAS / Beijing / China / University of Chinese Academy of Sciences / Beijing / China)

Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.

cs / cs.CL