動画理解技術の未来を切り開く！Video Deep Research Benchmark (VideoDR)

Published：2026/1/11 15:07:37

タイトル & 超要約：動画研究の新基準！VideoDRで動画を賢く活用💖

ギャル的キラキラポイント✨ ● 動画とウェブを融合！情報検索が爆速になる予感✨ ● AIが動画の内容を理解して、ウェブ検索までしてくれるってスゴくない？😍 ● 新しいビジネスチャンスがゴロゴロ転がってそうじゃん？💎

詳細解説 ● 背景動画って、今や情報収集の主役！でも、動画の内容を全部理解して、関連情報をウェブで探すのって大変じゃない？😩 この研究は、その問題を解決するために、動画とウェブ情報を組み合わせたAI開発を目指してるんだって！

● 方法 VideoDRってベンチマーク（評価基準）を使って、AIの性能を測るんだって！動画を見て、そこからヒントを得てウェブ検索し、色んな情報を統合して答えを出す…みたいな高度なことをAIにやらせるんだって！すごい！🤯

● 結果 AIが動画の内容を理解して、ウェブ検索までしてくれるから、欲しい情報にすぐたどり着けるようになるってこと！動画検索エンジンとか、情報キュレーションサービスとか、色々できそうじゃん？ワクワクするね！🥰

続きは「らくらく論文」アプリで

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

Chengwen Liu / Xiaomin Yu / Zhuoyue Chang / Zhe Huang / Shuo Zhang / Heng Lian / Kunyi Wang / Rui Xu / Sen Hu / Jianheng Hou / Hao Peng / Chengwei Qin / Xiaobin Hu / Hong Peng / Ronghao Chen / Huacan Wang

In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

cs / cs.CV / cs.AI

Arxivで見る