超スゴイ！SuperCLIPで画像検索が神レベルに✨

Published：2025/12/16 15:11:53

超スゴイ！SuperCLIPで画像検索が神レベルに✨

1. 画像と文章を最強合体！SuperCLIPで検索爆速！

2. ギャル的キラキラポイント✨ ● 文章を単語ごとに分解(ぶんかい)して、画像との関係性を詳しく見てるんだって！ ● 既存のAIより、検索とか分類(ぶんるい)がめっちゃ正確になるらしい！ ● 色んなサービスに応用できるから、IT業界がアゲアゲになる予感💖

3. 詳細解説 ● 背景画像と文章を繋げる技術、CLIPがすごいんだけど、文章の細かいとこまでは見てなかったの😢 もっと高性能にするために、SuperCLIPが登場！

● 方法文章を単語ごとに分けて、それぞれが画像とどんな関係があるかチェックするよ👀✨ これで、より細かい部分まで分析できる！

続きは「らくらく論文」アプリで

SuperCLIP: CLIP with Simple Classification Supervision

Weiheng Zhao / Zilong Huang / Jiashi Feng / Xinggang Wang

Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP's ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP's small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes. Code and models will be made open source.

cs / cs.CV

Arxivで見る