LLM爆速化！非同期TB (TBA) でAIを最強にする方法✨

Published：2025/12/3 18:56:50

LLM爆速化！非同期TB (TBA) でAIを最強にする方法✨

超要約: LLM (大規模言語モデル) を爆速で賢くする新技！非同期TB (TBA) ってのがスゴいの！🚀

ギャル的キラキラポイント✨

● LLM の学習を爆速💨にする魔法みたいな技術！ ● オフポリシー型で、データ処理がめっちゃ効率的💖 ● AI の性能アップ⤴️も期待できちゃう！

詳細解説

続きは「らくらく論文」アプリで

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

Brian Bartoldson / Siddarth Venkatraman / James Diffenderfer / Moksh Jain / Tal Ben-Nun / Seanie Lee / Minsu Kim / Johan Obando-Ceron / Yoshua Bengio / Bhavya Kailkhura

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, on-policy algorithms used for post-training are not naturally robust to a diversified content of experience replay buffers, which asynchronous off-policy actors can efficiently populate in parallel to training. We propose efficiently learning on such off-policy data via Trajectory Balance with Asynchrony (TBA), an approach to asynchronous RL for LLMs that leverages the principled off-policy TB objective. On math, preference-tuning, and automated red-teaming tasks, we post-train models ranging from Pythia 410M to Qwen 2.5 7B, finding TBA offers speed and performance boosts over strong baselines like Online DPO and Dr. GRPO. Beyond TBA's performance benefits (high accuracy even as asynchrony grows) and speedups ($4\times$ or more), we show its reward- and recency-prioritizing sampling enable further gains as data generation is scaled. Our code is available at https://github.com/bbartoldson/TBA.

cs / cs.LG

Arxivで見る