VLM報酬のノイズ問題と、その対策

Published：2025/11/8 4:37:05

タイトル & 超要約：VLMの課題を解決！AIエージェントを爆速で賢くする魔法🧙‍♀️

● ギャル的キラキラポイント✨ その1: VLM (Vision-Language Model) 報酬のノイズ問題を解決するんだって！ ● ギャル的キラキラポイント✨ その2: BIMI報酬関数っていう新しい報酬関数がスゴイらしい！誤った報酬を減らして、AIの学習効率を上げるんだって😳 ● ギャル的キラキラポイント✨ その3: ロボットとかAIアシスタントが、もっと賢く、もっと私たちの役に立つようになるってこと💖

詳細解説：

背景 VLM報酬って、AIが言葉で指示されたことを理解して動くためのご褒美システムのこと！でも、複雑なことや時間のかかることをやらせようとすると、上手くいかないことがあったみたい🥺 それは、VLM報酬の中にノイズ（誤った情報みたいなもの）が混ざってるからみたい！

方法そこで、BIMI報酬関数っていう新しい秘密兵器を開発！BIMIは、バイナリ信号（0か1かみたいなもの）と相互情報量（2つの情報の関連性）を使って、ノイズを減らすんだって！これで、AIが正しい行動をした時にだけ、ちゃんと褒めてあげられるようになるみたい💖

続きは「らくらく論文」アプリで

The Dark Side of Rich Rewards: Understanding and Mitigating Noise in VLM Rewards

Sukai Huang / Shu-Wei Liu / Nir Lipovetzky / Trevor Cohn

While Vision-Language Models (VLMs) are increasingly used to generate reward signals for training embodied agents to follow instructions, our research reveals that agents guided by VLM rewards often underperform compared to those employing only intrinsic (exploration-driven) rewards, contradicting expectations set by recent work. We hypothesize that false positive rewards -- instances where unintended trajectories are incorrectly rewarded -- are more detrimental than false negatives. Our analysis confirms this hypothesis, revealing that the widely used cosine similarity metric is prone to false positive reward estimates. To address this, we introduce BiMI ({Bi}nary {M}utual {I}nformation), a novel reward function designed to mitigate noise. BiMI significantly enhances learning efficiency across diverse and challenging embodied navigation environments. Our findings offer a nuanced understanding of how different types of reward noise impact agent learning and highlight the importance of addressing multimodal reward signal noise when training embodied agents

cs / cs.LG / cs.RO

Arxivで見る