CMMCoT: 複雑なマルチイメージ理解を深めるための、マルチモーダルChain-of-Thoughtとメモリ拡張

Published：2025/12/4 0:54:11

タイトル & 超要約：最強AI、マルチ画像理解で未来を掴む！✨

🌟 ギャル的キラキラポイント✨ ● マルチ画像（複数の写真）を賢く理解するAIの話だよ！💖 ● 人間の思考（考え方）を真似して、AIをさらに賢くするんだって！🧠 ● IT業界（ITぎょうかい）の未来を明るくするスゴい技術なの！🚀

詳細解説 ● 背景最近のAIは画像理解がスゴイけど、複数の画像を見比べるのは苦手だったの😭 今回の研究は、そんなAIの弱点を克服（こくふく）する為のもの！まるでギャルのように、色んな情報を総合的に見て、賢く判断できるようになるんだね！

● 方法人間が画像を比較するように、AIに「これは〇〇で、これは△△だね！」って考えさせるんだって！🤔 それに加えて、重要な情報を「記憶」する機能もプラス！まるで、プリクラで盛るみたいに、AIを可愛くアレンジする感じ💖

● 結果このAI、マルチイメージ理解が爆上がり！✨ 複雑な情報も理解できるようになって、AIの判断も分かりやすくなったんだって！まるで、メイクのビフォーアフターみたいに、劇的に変化するんだね！

続きは「らくらく論文」アプリで

CMMCoT: Enhancing Complex Multi-Image Comprehension via Multi-Modal Chain-of-Thought and Memory Augmentation

Guanghao Zhang / Tao Zhong / Yan Xia / Mushui Liu / Zhelun Yu / Haoyuan Li / Wanggui He / Fangxun Shu / Dong She / Yi Wang / Hao Jiang

While previous multimodal slow-thinking methods have demonstrated remarkable success in single-image understanding scenarios, their effectiveness becomes fundamentally constrained when extended to more complex multi-image comprehension tasks. This limitation stems from their predominant reliance on text-based intermediate reasoning processes. While for human, when engaging in sophisticated multi-image analysis, they typically perform two complementary cognitive operations: (1) continuous cross-image visual comparison through region-of-interest matching, and (2) dynamic memorization of critical visual concepts throughout the reasoning chain. Motivated by these observations, we propose the Complex Multi-Modal Chain-of-Thought (CMMCoT) framework, a multi-step reasoning framework that mimics human-like "slow thinking" for multi-image understanding. Our approach incorporates two key innovations: (1) The construction of interleaved multimodal multi-step reasoning chains, which utilize critical visual region tokens, extracted from intermediate reasoning steps, as supervisory signals. This mechanism not only facilitates comprehensive cross-modal understanding but also enhances model interpretability. (2) The introduction of a test-time memory augmentation module that expands the model's reasoning capacity during inference while preserving parameter efficiency. Furthermore, to facilitate research in this direction, we have curated a novel multi-image slow-thinking dataset. Extensive experiments demonstrate the effectiveness of our model. Code is available at https://github.com/zhangguanghao523/CMMCoT.

cs / cs.CV

Arxivで見る