AIが画像を見て考える！CodeVisionで画像認識が超進化✨

Published：2025/12/3 12:44:15

AIが画像を見て考える！CodeVisionで画像認識が超進化✨

超要約： AIに画像を見て「考える力」を！CodeVisionで画像認識を爆上げ🚀

🌟 ギャル的キラキラポイント✨ ● AIが自分でコードを書いて、色んな画像編集ツールを使えるようになるってマジ！？🤩 ● ノイズ（画像の乱れ）や向きの変化にも強くなるから、どんな画像でもOK！😎 ● 複数のツールを組み合わせて、複雑なこともできるようになるって、最強じゃん？😍

詳細解説 ● 背景画像認識のAI（MLLMs）って、すごい進化してるけど、まだまだ課題がいっぱい💦例えば、画像がちょっと傾いてたり、ノイズがあったりすると、上手く認識できないこともしばしば…💔 そこで、色んなツールを使いこなして、もっと賢く画像を見れるようにしちゃおう！って研究なんだって✨

● 方法 MLLMsが「CodeVision」っていうフレームワークを使って、自分でコードを生成するんだって！😳 つまり、AIが自分で「トリミングしてね」とか「色を調整してね」って指示を出せるようになるってこと💖 いろんなツールを駆使して、複雑なタスクもこなせるようになるんだって！マルチタスク、最強💪

続きは「らくらく論文」アプリで

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Zirun Guo / Minjie Hong / Feng Zhang / Kai Jia / Tao Jin

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.

cs / cs.CV / cs.CL

Arxivで見る