最強ギャルAI爆誕！視覚的推論AIで未来がアゲ ✨

Published：2025/11/7 20:50:54

最強ギャルAI爆誕！視覚的推論AIで未来がアゲ ✨

超要約: 視覚で理解するAI、爆誕！ビジネスもアゲアゲ🚀

🌟 ギャル的キラキラポイント✨ ● 100万件以上のデータで、AIが視覚的推論をマスターしちゃうんだって！すごーい😍 ● まるで人間みたいに、検証したり、目標設定したりするAIが作れるって、マジやばくない？ ● 画像認識とか、ロボット制御とか、色んな分野で大活躍！ビジネスチャンス到来だよ😎

詳細解説 ● 背景 AIちゃんは、文章は得意だけど、画像とか動画の理解はちょっと苦手だったの😢でも、この研究で、AIも「見る」こと、つまり視覚的推論をめっちゃ得意にさせようってわけ！✨

● 方法 100万件以上の画像と質問のセットを作って、AIちゃんに学習させるんだって！💖 最初は簡単な質問から、どんどん難しい問題に挑戦できるように工夫してるみたい。まるでギャルの成長みたいじゃん？😎

続きは「らくらく論文」アプリで

Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale

David Acuna / Chao-Han Huck Yang / Yuntian Deng / Jaehun Jung / Ximing Lu / Prithviraj Ammanabrolu / Hyunwoo Kim / Yuan-Hong Liao / Yejin Choi

Recent progress in multimodal reasoning has been driven largely by undisclosed datasets and proprietary data synthesis recipes, leaving open questions about how to systematically build large-scale, vision-centric reasoning datasets, particularly for tasks that go beyond visual math. In this work, we introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions. The dataset also includes preference data and instruction prompts supporting both offline and online RL. Our synthesis framework proceeds in two stages: (1) scale; and (2) complexity. Reasoning traces are then synthesized through a two-stage process that leverages VLMs and reasoning LLMs, producing CoT traces for VLMs that capture the richness and diverse cognitive behaviors found in frontier reasoning models. Remarkably, we show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks, and even surpasses strong closed-data models such as MiMo-VL-7B-RL on V* Bench, CV-Bench and MMStar-V. Perhaps most surprising, despite being entirely vision-centric, our data transfers positively to text-only reasoning (MMLU-Pro) and audio reasoning (MMAU), demonstrating its effectiveness. Similarly, despite not containing videos or embodied visual data, we observe notable gains when evaluating on a single-evidence embodied QA benchmark (NiEH). Finally, we use our data to analyze the entire VLM post-training pipeline. Our empirical analysis highlights that (i) SFT on high-quality data with non-linear reasoning traces is essential for effective online RL, (ii) staged offline RL matches online RL's performance while reducing compute demands, and (iii) careful SFT on high quality data can substantially improve out-of-domain, cross-modality transfer.

cs / cs.CV / cs.AI / cs.CL

Arxivで見る