超要約:テキスト検索で動画をピンポイント検索する技術がスゴいの!🎉
🌟 ギャル的キラキラポイント✨ ● モダリティギャップ(表現の違い)のせいで検索がズレる問題を解決!😳 ● ノイズ(誤った情報)に強く、どんな動画にも対応できる検索に!😎 ● 既存の検索エンジンをアップデートするだけで、精度爆上がり⤴️
詳細解説 ● 背景 動画検索(TVR)って、テキストで動画を探すこと💻✨。でも、言葉と動画って表現方法が違うから、うまく検索できないことがあったの😢。
● 方法 GAREフレームワークっていうスゴい技術を開発したよ!💡 テキストと動画の間の「ズレ」を調整する機能を追加して、ノイズにも強くなったんだ!
続きは「らくらく論文」アプリで
Recent progress in text-video retrieval has been largely driven by contrastive learning. However, existing methods often overlook the effect of the modality gap, which causes anchor representations to undergo in-place optimization (i.e., optimization tension) that limits their alignment capacity. Moreover, noisy hard negatives further distort the semantics of anchors. To address these issues, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment $\Delta_{ij}$ between text $t_i$ and video $v_j$, redistributing gradients to relieve optimization tension and absorb noise. We derive $\Delta_{ij}$ via a multivariate first-order Taylor expansion of the InfoNCE loss under a trust-region constraint, showing that it guides updates along locally consistent descent directions. A lightweight neural module conditioned on the semantic gap couples increments across batches for structure-aware correction. Furthermore, we regularize $\Delta$ through a variational information bottleneck with relaxed compression, enhancing stability and semantic consistency. Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation. Code is available at https://github.com/musicman217/GARE-text-video-retrieval.