VLMの空間認識、爆上げ！VSTって最強じゃん？🌟（超要約：空間認識AIの革命）

Published：2025/11/7 18:59:16

VLMの空間認識、爆上げ！VSTって最強じゃん？🌟（超要約：空間認識AIの革命）

ギャルのココ注目ポイント！ ● VLM（画像とテキストを理解するAI）の空間認識を爆上がりさせる技術✨ ● ロボットとか自動運転とか、色んな分野で役立つ！ ● 人間みたいに空間を理解できるAIが、もうすぐそこに💖
詳細解説、いくよー！
- 背景: VLMはすごいけど、空間認識が苦手だったの。人間みたいに、周りの状況を把握するのが難しかったんだよね🤔
- 方法: 「Visual Spatial Tuning (VST)」ってフレームワークで解決！大量のデータで学習して、空間推論能力もアップ⤴️
- 結果: VLMの空間認識能力が劇的に向上！ロボットが賢くなったり、自動運転がもっと安全になるかも🚗💨
- 意義（ここがヤバい♡ポイント）: ロボット、自動運転、AR/VR…色んな分野で、人間みたいな空間認識ができるAIが活躍する未来が来るってこと！ IT業界、めっちゃアツい🔥
リアルで使える！アイデア💡
- スマート家電：部屋のレイアウトを認識して、最適な家電の使い方を提案！
- ARショッピング：お部屋に家具を配置したときのイメージを、ARで確認できちゃう！
もっと知りたい子のために🔍
- VLM (Visual Language Model)
- 空間認識
- ファインチューニング

続きは「らくらく論文」アプリで

Visual Spatial Tuning

Rui Yang / Ziyu Zhu / Yanwei Li / Jingjia Huang / Shen Yan / Siyuan Zhou / Zhe Liu / Xiangtai Li / Shuangye Li / Wenqian Wang / Yi Lin / Hengshuang Zhao

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including $34.8\%$ on MMSI-Bench and $61.2\%$ on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

cs / cs.CV

Arxivで見る