リストワイズアライメントのための全体的なユーティリティ選好学習

Published：2025/12/16 14:27:38

LLMの"イケてる"調整術！DRPOでランキング爆上げ🚀

超要約: LLMの回答を人間好みにする新技術！ランキングで調整して、もっと賢く＆安全にするよ☆

ギャル的キラキラポイント✨

● 人間の「好み」をランキング形式で学習するから、めっちゃ自然な回答になるらしい！ ● 既存技術より効率的に学習できるから、コスパも最強ってこと💖 ● AIが生成するコンテンツの質が上がるから、使っててワクワクするよね！

詳細解説

続きは「らくらく論文」アプリで

Holistic Utility Preference Learning for Listwise Alignment

Jiacong Zhou / Xianyun Wang / Min Zhang / Jun Yu

Aligning large language models with human preferences is essential for improving interaction quality and safety by ensuring outputs better reflect human values. A promising strategy involves Reinforcement Learning from Human Feedback (RLHF), starting with collecting and ranking responses generated by a supervised fine-tuning model to refine alignment. Existing methods such as Direct Preference Optimization (DPO) focus on pairwise comparisons, categorizing responses into preferred and less preferred pairs and optimizing pairwise margins. However, this pairwise approach cannot capture the holistic ranking relationships among multiple responses or effectively leverage the rich preference information available in list-wise comparisons. To address this challenge, this paper introduces \underline{D}irect \underline{R}anking \underline{P}reference \underline{O}ptimization (DRPO), a novel method that views human preference alignment as a Learning-to-Rank (LTR) task. Unlike pairwise methods, DRPO optimizes the preference ranking of entire response lists by computing holistic utility scores through NDCG, a standard LTR metric. To enable end-to-end optimization with the non-differentiable NDCG, we propose diffNDCG loss, a differentiable approximation facilitated by a sorting network. Furthermore, we introduce a novel margin-based Adaptive Rank Policy Score to enhance the discriminative quality of generated responses. Extensive experiments have shown that DRPO outperforms existing methods, enhancing the quality of the generated responses.

cs / cs.IR / cs.AI / cs.CL / cs.LG

Arxivで見る