GDRO: 拡散モデル向けグループレベル報酬後処理技術の解説

Published：2026/1/5 11:47:18

最強ギャルAI、降臨～！✨ 今回は「GDRO」について解説していくよ！準備はOK？レッツゴー！

タイトル & 超要約 GDROって最強！拡散モデル（画像作るやつ）を、グループ報酬で爆速＆高品質にしちゃうテクだよ！🌟
ギャル的キラキラポイント✨ ● オフライン学習(オフラインが重要らしい) で、爆速＆高コスパなの！💖 ● 報酬ハッキング(イケてない画像作っちゃうこと) を防ぐ工夫もバッチリ👌 ● テキストと画像の整合性（ズレがないこと）が超上がる！🤩
詳細解説
- 背景画像生成AI（テキストから画像作るやつ）ってスゴイじゃん？でも、もっと良くしたい！拡散モデルっていう技術を使うんだけど、性能UPのためにRL（強化学習）っていう方法を使う場合があるのね。でもね、色々問題があったの😭
- 方法 GDROは「グループレベル報酬」っていう、ちょい高度なテクを使うの。グループで評価するから、一枚一枚の出来じゃなくて、全体のバランスが良くなるんだよね！オフライン学習だから、時間もかからないし、サンプラー（画像生成の仕組み）にも依存しないんだって！
- 結果すっごくイケてる画像が、爆速で作れるようになったってこと！🎉 しかも、テキスト（指示文）とのズレも少なくなって、マジ神！✨
- 意義（ここがヤバい♡ポイント） クリエイティブ（創作）な分野とか、eコマース（ネット通販）で大活躍間違いなし！広告とか、商品の写真とか、色んなものがハイクオリティになる予感！ビジネスチャンスが広がるね！😍
リアルでの使いみちアイデア💡
- SNSのアイコンを、指示文で簡単に作れるアプリとかあったら良くない？😎
- 自分のブログ記事に合った画像を、AIが自動で作ってくれるサービスとかもイイね！💻

続きは「らくらく論文」アプリで

GDRO: Group-level Reward Post-training Suitable for Diffusion Models

Yiyang Wang / Xi Chen / Xiaogang Xu / Yu Liu / Hengshuang Zhao

Recent advancements adopt online reinforcement learning (RL) from LLMs to text-to-image rectified flow diffusion models for reward alignment. The use of group-level rewards successfully aligns the model with the targeted reward. However, it faces challenges including low efficiency, dependency on stochastic samplers, and reward hacking. The problem is that rectified flow models are fundamentally different from LLMs: 1) For efficiency, online image sampling takes much more time and dominates the time of training. 2) For stochasticity, rectified flow is deterministic once the initial noise is fixed. Aiming at these problems and inspired by the effects of group-level rewards from LLMs, we design Group-level Direct Reward Optimization (GDRO). GDRO is a new post-training paradigm for group-level reward alignment that combines the characteristics of rectified flow models. Through rigorous theoretical analysis, we point out that GDRO supports full offline training that saves the large time cost for image rollout sampling. Also, it is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtain stochasticity. We also empirically study the reward hacking trap that may mislead the evaluation, and involve this factor in the evaluation using a corrected score that not only considers the original evaluation reward but also the trend of reward hacking. Extensive experiments demonstrate that GDRO effectively and efficiently improves the reward score of the diffusion model through group-wise offline optimization across the OCR and GenEval tasks, while demonstrating strong stability and robustness in mitigating reward hacking.

cs / cs.LG / cs.CV

Arxivで見る