iconLogo
Published:2026/1/2 6:04:58

動画検索、爆速進化!GranAlignで動画の世界を制覇しよっ☆

超要約:ゼロショ動画検索を格段に進化させるGranAlign、君も使ってみ?

✨ ギャル的キラキラポイント ✨ ● ゼロショ(訓練なし)で動画検索できるって、すごくない?😳 ● テキスト検索と動画の詳細さのズレを、AIがイイ感じに調整してくれるの!💕 ● 動画検索の精度が爆上がりして、ビジネスチャンス到来の予感…!😎

🌟 詳細解説 🌟 背景 動画検索って、難しいよね💦 テキストで検索しても、ドンピシャなシーンに辿り着くのって至難の業じゃん?🤔 でも、GranAlignを使えば、それが叶っちゃうんだって! しかも、学習データなしでOKってところが、めっちゃすごい✨

方法 GranAlignは、テキストクエリ(検索したい言葉)を「簡略化」と「詳細化」の2種類に変換。さらに、動画の内容も、クエリに合わせてキャプション(説明文)を生成するらしい! なんか、すごい技術使ってそうだけど、結果が出てるからOK!👍

続きは「らくらく論文」アプリで

GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval

Mingyu Jeon / Sunjae Yoon / Jonghee Kim / Junyeoung Kim

Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and fine semantic representations. Our approach introduces two complementary techniques: granularity-based query rewriting to generate varied semantic granularities, and query-aware caption generation to embed query intent into video content. By pairing multi-level queries with both query-agnostic and query-aware captions, we effectively resolve semantic mismatches. As a result, our method sets a new state-of-the-art across all three major benchmarks (QVHighlights, Charades-STA, ActivityNet-Captions), with a notable 3.23% mAP@avg improvement on the challenging QVHighlights dataset.

cs / cs.CV