最強LLM、ドメインを飛び越える！✨🚀

Published：2026/1/7 4:36:05

最強LLM、ドメインを飛び越える！✨🚀

超賢いAIちゃんが、色んな問題を超高速で解ける秘密を暴くよ！賢さの限界突破、期待大💕

✨ ギャル的キラキラポイント ✨ ● 数学しか勉強してないのに、他の分野の問題もスイスイ解けちゃう！天才かよ🤩 ● 「Plan-Action-Reflection」サイクルで、まるで人間みたいに考えまくる！賢すぎ👏 ● 長～い問題も、学習が安定してちゃんと解ける！すごいじゃん💖

詳細解説いくよ～！

背景：LLM (大規模言語モデル) って、めちゃ賢いけど、特定の分野にしか強くない問題があったの。色んな分野で活躍できるようにしたい！って研究が始まったんだって🤔

続きは「らくらく論文」アプリで

Reinforcement Learning for Tool-Integrated Interleaved Thinking towards Cross-Domain Generalization

Zhengyu Chen / Jinluan Yang / Teng Xiao / Ruochen Zhou / Luan Zhang / Xiangyu Xi / Xiaowei Shi / Wei Wang / Jinggang Wang

Recent advances in large language models (LLMs) have demonstrated remarkable capabilities in reasoning and tool utilization. However, the generalization of tool-augmented reinforcement learning (RL) across diverse domains remains a significant challenge. Standard paradigms often treat tool usage as a linear or isolated event, which becomes brittle when transferring skills from restricted domains (e.g., mathematics) to open-ended tasks. In this work, we investigate the cross-domain generalization of an LLM agent trained exclusively on mathematical problem-solving. To facilitate robust skill transfer, we propose a {\textbf{R}einforcement Learning for \textbf{I}nterleaved \textbf{T}ool \textbf{E}xecution (RITE)}. Unlike traditional methods, RITE enforces a continuous ``Plan-Action-Reflection'' cycle, allowing the model to ground its reasoning in intermediate tool outputs and self-correct during long-horizon tasks. To effectively train this complex interleaved policy, we introduce {Dr. GRPO}, a robust optimization objective that utilizes token-level loss aggregation with importance sampling to mitigate reward sparsity and high-variance credit assignment. Furthermore, we employ a dual-component reward system and dynamic curriculum via online rollout filtering to ensure structural integrity and sample efficiency. Extensive experiments reveal that our approach, despite being trained solely on math tasks, achieves state-of-the-art performance across diverse reasoning domains, demonstrating high token efficiency and strong generalization capabilities.

cs / cs.LG / cs.CL

Arxivで見る