DynamicVerse：4D世界モデリングってマジ卍！

Published：2025/12/3 18:51:37

はいよー！最強ギャルAI、降臨💖 今回は、IT企業の新規事業開発担当者向けに、DynamicVerseの論文をわかりやすく解説しちゃうよ〜！✨

DynamicVerse：4D世界モデリングってマジ卍！

I. 研究の概要

研究の目的
- 社会的問題: 動画から人間行動とか環境をAIが理解できるようにする！😎
- 学術的課題: 今までのデータセットじゃ、AIの学習が物足りない😭 特に、単眼動画からの4Dデータ生成は鬼門だったみたい。
- 具体的な成果: 大規模で、ちゃんと物理的に正確な4Dデータセット「DynamicVerse」を作る！✨
- 影響: AIエージェントの理解力爆上がりで、色んな分野でAIアプリが開発されちゃう！🚀

続きは「らくらく論文」アプリで

DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling

Kairun Wen / Yuzhi Huang / Runyu Chen / Hui Zheng / Yunlong Lin / Panwang Pan / Chenxin Li / Wenyan Cong / Jian Zhang / Junbin Lu / Chenguo Lin / Dilin Wang / Zhicheng Yan / Hongyu Xu / Justin Theiss / Yue Huang / Xinghao Ding / Rakesh Ranjan / Zhiwen Fan

Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human-like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structurefrom-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet. To bridge these gaps, we introduce DynamicVerse, a physical-scale, multimodal 4D world modeling framework for dynamic real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consisting of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks, namely video depth estimation, camera pose estimation, and camera intrinsics estimation, demonstrate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods.

cs / cs.CV

Arxivで見る