森を見て木を見る！画像理解AI「Laser」爆誕✨

Published：2026/1/11 8:30:49

森を見て木を見る！画像理解AI「Laser」爆誕✨

超要約： 画像理解AI「Laser」は、効率的&高精度な推論で画像検索とかを爆速進化させるよ！

ギャル的キラキラポイント✨

● 推論を可視化 👀: Laserは推論過程を見せてくれるから、なんでそういう結果になったのか一目瞭然！モデルが何を見てるか分かるとか、エモくない？ ● 「森→木」思考🌳: 人間の脳みたいに、全体から詳細へって段階を踏むから、賢くて理解も深まるってワケ！ ● 爆速＆高精度🚀: 今までのAIより、ずっと速く、正確に画像のこと理解できちゃう！使える幅も広がるよね～！

詳細解説

続きは「らくらく論文」アプリで

Forest Before Trees: Latent Superposition for Efficient Visual Reasoning

Yubo Wang / Juntian Zhang / Yichen Wu / Yankai Lin / Nils Lukas / Yuhan Liu

While Chain-of-Thought empowers Large Vision-Language Models with multi-step reasoning, explicit textual rationales suffer from an information bandwidth bottleneck, where continuous visual details are discarded during discrete tokenization. Recent latent reasoning methods attempt to address this challenge, but often fall prey to premature semantic collapse due to rigid autoregressive objectives. In this paper, we propose Laser, a novel paradigm that reformulates visual deduction via Dynamic Windowed Alignment Learning (DWAL). Instead of forcing a point-wise prediction, Laser aligns the latent state with a dynamic validity window of future semantics. This mechanism enforces a "Forest-before-Trees" cognitive hierarchy, enabling the model to maintain a probabilistic superposition of global features before narrowing down to local details. Crucially, Laser maintains interpretability via decodable trajectories while stabilizing unconstrained learning via Self-Refined Superposition. Extensive experiments on 6 benchmarks demonstrate that Laser achieves state-of-the-art performance among latent reasoning methods, surpassing the strong baseline Monet by 5.03% on average. Notably, it achieves these gains with extreme efficiency, reducing inference tokens by more than 97%, while demonstrating robust generalization to out-of-distribution domains.

cs / cs.CL / cs.CV

Arxivで見る