動画LLMの嘘（ハルシネーション）を止める！✨

Published：2025/12/4 1:05:16

タイトル & 超要約：動画LLMの嘘（ハルシネーション）を止める！✨

● ギャル的キラキラポイント✨1: 動画の内容をちゃんと理解して説明するAIを目指してるってこと💖 ● ギャル的キラキラポイント✨2: 動画に出てくるもの（オブジェクト）と動き（アクション）をちゃんと認識させるのがミソなのね！😎 ● ギャル的キラキラポイント✨3: ヘルスケア🏥とか自動運転🚗とか、正確さが大事な分野で役立つかもってことー！😳

詳細解説：

背景： LLM（大規模言語モデル）ってすごいけど、動画の説明は苦手だったの💦動画にないものまで「ある！」って言っちゃうこともあったり…これじゃ困るよね？🥺
方法： SANTA（Self-Augmented Contrastive Alignment）っていう新しいやり方で、動画の内容をちゃんと理解させようとしてるの！✨ 自分の説明の間違いを自分で見つけて、訂正するイメージかな？🧐
結果：オブジェクト（物）とアクション（動き）をしっかり対応させられるようになったみたい！👏動画の説明が、より正確になったってことだね！💯
意義（ここがヤバい♡ポイント）：医療とか、自動運転とか、間違った情報が命に関わる分野で、めっちゃ役立つ可能性大！😳 信頼できるAIが作れるって、すごくない！？💖

リアルでの使いみちアイデア💡：

YouTubeとかの動画に、正確な自動字幕をつけられるようになるかも！📝
お買い物サイトで、商品の動画説明がもっとわかりやすくなるかもね！🛍️

もっと深掘りしたい子へ🔍 キーワード：

マルチモーダルLLM
ハルシネーション
対照学習

続きは「らくらく論文」アプリで

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Kai-Po Chang / Wei-Yuan Cheng / Chi-Pin Huang / Fu-En Yang / Yu-Chiang Frank Wang

Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.

cs / cs.CV / cs.AI / cs.CL / cs.LG

Arxivで見る