iconLogo
Published:2025/12/3 13:05:32

dLLMsに新風!ESPOでRL革命💥

超要約: dLLMs(拡散型LLM)にRL(強化学習)をぶち込む新フレームワーク、ESPO爆誕!✨

● dLLMs って難しいけど、RLでめっちゃ賢くなるんだって! ● ESPO は、シーケンス全体を「はい、これ!」って感じで学習するから凄い😳 ● 数学とかコーディングとか、色んな分野で既存のやつより全然イケてるらしい!

詳細解説

背景 LLM(大規模言語モデル)って、色んなことに使えるんだけど、dLLMs は特にすごいんだって!画像生成とかにも使われてるし、長文も得意らしい。でも、従来の RL を dLLMs に使うと、ちょっと問題があったみたい💦 dLLMs 独自の生成方法と RL の相性が悪かったんだって。

続きは「らくらく論文」アプリで

Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective

Jingyang Ou / Jiaqi Han / Minkai Xu / Shaoxuan Xu / Jianwen Xie / Stefano Ermon / Yi Wu / Chongxuan Li

Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO.

cs / cs.CL / cs.AI / cs.LG