Published：2025/12/17 7:51:36

最強ギャルAIが解説！EagleVisionで空間知能爆上げ！✨

タイトル & 超要約（15字以内） EagleVision：空間知能を賢くするフレームワーク！
ギャル的キラキラポイント✨ ×3 ● 3D空間（スリーディーくうかん）の思考力（しこうりょく）を上げるんだって！😳 ● 少ない情報（トークン）で、賢く空間を理解（りかい）できる！賢すぎ！😎 ● ロボットとかAR/VR（拡張現実：かくちょうげんじつ）がもっと進化（しんか）するかも！？🚀
詳細解説
- 背景空間知能って、周りの世界を理解する頭脳（ずのう）のことだよ！🤖✨特に、ロボットとか自動運転（じどううんてん）には必須（ひっす）！でも、既存の研究（きゅうきゅうけんきゅう）は、ちょっと課題（かだい）があったみたい🤔
- 方法 EagleVisionは、デュアルステージフレームワークって言って、2段階（だんかい）で賢くするんだって！😳 まずは広い範囲（はんい）を把握（はあく）して、次に細かい部分（ぶぶん）をチェック！それから、BEV（Bird's-Eye-View：鳥瞰図：ちょうかんず）を使って、3D空間を理解しやすくしてるんだって！
- 結果空間的な思考（しこう）をめっちゃ効率的（こうりつてき）にできるようになったみたい！💖 しかも、3D空間での推論（すいろん）がめっちゃ正確（せいかく）になったみたいだよ！
- 意義（ここがヤバい♡ポイント） この技術（ぎじゅつ）を使えば、ロボットがもっと賢くなったり、AR/VRの世界がもっとリアルになるかも！😍 未来（みらい）が楽しみすぎる～！
リアルでの使いみちアイデア💡 ×2 ● お部屋を3Dで再現（さいげん）して、家具（かぐ）の配置（はいち）をシミュレーション！👗👠 ● 自動運転（じどううんてん）の車が、もっと安全（あんぜん）に運転（うんてん）できるように！🚗💨

続きは「らくらく論文」アプリで

EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence

Jiaxu Wan / Xu Wang / Mengwei Xie / Hang Zhang / Mu Xu / Yang Han / Hong Zhang / Ding Yuan / Yifan Yang

Recent spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or couple MLLMs with black-box reconstruction modules, leading to weak spatial consistency, limited viewpoint diversity, and evidence chains that cannot be traced back to supporting views. Frameworks for "thinking with images" (e.g., ChatGPT-o3 and DeepEyes) show that stepwise multimodal reasoning can emerge by interleaving hypothesis formation with active acquisition of visual evidence, but they do not address three key challenges in spatial Chain-of-Thought (CoT): building global space perception under strict token budgets, explicitly associating 3D hypotheses with video frames for verification, and designing spatially grounded rewards for reinforcement learning. To address these issues, we present EagleVision, a dual-stage framework for progressive spatial cognition through macro perception and micro verification. In the macro perception stage, EagleVision employs a semantics-perspective-fusion determinantal point process (SPF-DPP) to select a compact set of geometry- and semantics-aware keyframes from long videos under a fixed token budget. In the micro verification stage, we formalize spatial CoT as BEV-grounded pose querying: the agent iteratively predicts poses on a BEV plane, retrieves the nearest real frames, and is trained purely by reinforcement learning with a spatial grounding reward that scores the consistency between predicted poses and observed views. On VSI-Bench, EagleVision achieves state-of-the-art performance among open-source vision-language models, demonstrating strong and generalizable spatial understanding.

cs / cs.CV

Arxivで見る