iconLogo
Published:2026/1/7 6:16:41

動画検索が超進化⁉️V-Agentで未来の動画ライフを先取りしよっ!🎉

超要約: 動画の内容を賢く理解するAI「V-Agent」が爆誕!検索が超絶便利になるって話✨

ギャル的キラキラポイント

● 動画の内容をちゃんと理解!テキストだけじゃなくて、映像と音声も分析しちゃうんだって😳 ● 検索がマジで進化!質問したり、動画を比べたり、まるで友達と話してるみたいに動画を探せる💖 ● IT業界に革命💥YouTubeとか、お店の動画とか、色んな動画がもっと楽しくなる予感🎵

詳細解説

続きは「らくらく論文」アプリで

V-Agent: An Interactive Video Search System Using Vision-Language Models

SunYoung Park / Jong-Hyeon Lee / Youngjune Kim / Daegyu Sung / Younghyun Yu / Young-rok Cha / Jeongho Ju

We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications. The retrieval model and demo videos are available at https://huggingface.co/NCSOFT/multimodal-embedding.

cs / cs.CV / cs.AI / cs.IR / cs.MA