エージェント学習を劇的進化！批評家と共進化するAI✨

Published：2026/1/11 7:29:08

エージェント学習を劇的進化！批評家と共進化するAI✨

超要約: AIが賢く育つ秘訣！批評家と仲良く成長する学習法だよ💖
ギャル的キラキラポイント✨
- ● AIがどんどん賢くなる方法を発見！まるで友達と切磋琢磨（せっさたくま）するみたい😳
- ● 古くなったダメ出し（批評）じゃなくて、常にアプデされた的確なアドバイスをもらえるって最高じゃん？
- ● 複雑（ふくざつ）な問題もAIが解決してくれる未来が楽しみすぎる～😍
詳細解説
- 背景: AI（LLMエージェント）に難しい仕事をさせるには、良いアドバイス（批評）が必要なの！でも、そのアドバイスが古くなると、AIは成長止まっちゃう…😭
- 方法: AIと批評家が一緒に成長する「ECHO」って方法を考えたよ！ AIの行動に合わせて批評家も変化するから、いつでも良いアドバイスがもらえるってワケ😉
- 結果: 学習効率が爆上がり！AIが難しいタスクもこなせるようになったんだって！
- 意義（ここがヤバい♡ポイント）: いろんなサービスがもっと便利になる予感！AIが賢くなって、私たちの生活がもっと楽しくなるかも💖
リアルでの使いみちアイデア💡
- 最新のトレンドを教えてくれるAIファッションアドバイザー👗✨
- あなただけの学習プランを作ってくれるAI家庭教師👨‍🏫✨

続きは「らくらく論文」アプリで

No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

Zhicong Li / Lingjie Jiang / Yulan Hu / Xingchen Zeng / Yixia Li / Xiangwen Zhang / Guanhua Chen / Zheng Pan / Xin Li / Yong Liu

Critique-guided reinforcement learning (RL) has emerged as a powerful paradigm for training LLM agents by augmenting sparse outcome rewards with natural-language feedback. However, current methods often rely on static or offline critic models, which fail to adapt as the policy evolves. In on-policy RL, the agent's error patterns shift over time, causing stationary critics to become stale and providing feedback of diminishing utility. To address this, we introduce ECHO (Evolving Critic for Hindsight-Guided Optimization)}, a framework that jointly optimizes the policy and critic through a synchronized co-evolutionary loop. ECHO utilizes a cascaded rollout mechanism where the critic generates multiple diagnoses for an initial trajectory, followed by policy refinement to enable group-structured advantage estimation. We address the challenge of learning plateaus via a saturation-aware gain shaping objective, which rewards the critic for inducing incremental improvements in high-performing trajectories. By employing dual-track GRPO updates, ECHO ensures the critic's feedback stays synchronized with the evolving policy. Experimental results show that ECHO yields more stable training and higher long-horizon task success across open-world environments.

cs / cs.AI

Arxivで見る