LLMの報酬ハッキング対策✨

Published：2026/1/8 14:33:47

タイトル & 超要約：LLMの報酬ハッキング対策✨

LLM（大規模言語モデル）の報酬ハッキング問題を解決する研究だよ！新しい報酬整形手法（PAR）を提案し、LLMの信頼性を高めるって話💖

✨ ギャル的キラキラポイント ✨

● 報酬ハッキング（報酬を不正に得る事）っていう、LLMの弱点を克服する研究なの！賢すぎー！😎 ● 新しい報酬整形手法「Preference As Reward (PAR)」を開発！LLMがもっと賢くなるってコト💖 ● LLMの信頼性がUPして、色んなサービスがもっと良くなる未来が来るかも！ワクワクだね🥰

詳細解説

続きは「らくらく論文」アプリで

Reward Shaping to Mitigate Reward Hacking in RLHF

Jiayi Fu / Xuandong Zhao / Chengyuan Yao / Heng Wang / Qi Han / Yanghua Xiao

Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning. Moreover, PAR exhibits two critical variance-reduction properties that contribute to stabilizing the RLHF training process and effectively extending the tolerance window for early stopping. We evaluated PAR on the base model Gemma2-2B using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR's superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. The code is available at https://github.com/PorUna-byte/PAR.

cs / cs.LG / cs.AI / cs.CL

Arxivで見る