iconLogo
Published:2025/8/22 16:49:10

「Try Again」でLLMが覚醒✨マルチターン推論爆上がり!🚀

超要約: 「Try Again」って言うだけで、LLM(大規模言語モデル)が賢くなって、色んな問題を解決できるようになるんだって!すごい!🤩

✨ ギャル的キラキラポイント ✨

● 「Try Again」だけ!シンプルisベストでLLMを賢くする作戦💖 ● チャットボットとか、色々賢くなって、あたしたちの生活がもっと便利になるかも🎵 ● オープンソースで公開されてるから、みんなも試せるチャンス到来🎉

詳細解説

続きは「らくらく論文」アプリで

A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

Licheng Liu / Zihan Wang / Linjie Li / Chenwei Xu / Yiping Lu / Han Liu / Avirup Sil / Manling Li

Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., "Let's try again") after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback

cs / cs.LG / cs.AI