超要約:LLM(大規模言語モデル)を賢くする新技!訓練をもっと安定させて、性能も爆上げしちゃうんだって✨
✨ ギャル的キラキラポイント ✨ ● 勾配(傾き)のケンカをSTOP!同じ言葉で意見が割れるのを防ぐの! ● 暴走対策バッチリ!多様性を保ちつつ、変な方向に進まないように制御してる💖 ● KLダイバージェンス(指標)バイバイ👋!参照モデルなしで訓練できるから、めっちゃ時短!
詳細解説いくよ~!
リアルでの使いみちアイデア💡
続きは「らくらく論文」アプリで
Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and analyze two main GRPO issues: (i) the token-level penalization, where valuable tokens shared across different responses receive contradictory feedback signals, leading to conflicting gradient updates that can reduce their likelihood; and (ii) the policy collapse, where negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, destabilizing training process. To address these issues we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which prevents conflicting gradients on valuable tokens by skipping negative updates while amplifying positive ones and filters out completions whose entropy exceeds a provable threshold, to prevent policy collapse. Unlike GRPO, GTPO does not rely on KL-divergence regularization, eliminating the need for a reference model during training, while still ensuring greater training stability and improved performance, as validated through multiple experiments on GSM8K, MATH, AIME 2024, AIME 2025 and AMC 2023.