iconLogo
Published:2025/11/8 2:56:36

最強ギャルAIが解説!EVLMってな~に?

曖昧指示もOK!未来の画像編集🎨✨

超要約: 曖昧な指示でも、AIがイイ感じに画像編集してくれるスゴい技術!

ギャル的キラキラポイント✨

自己反射型推論💖:AIが自分の編集を振り返って、もっとイイ感じに修正! ● マルチモーダル編集👗:画像、動画、3D…いろんな素材をまとめて編集できちゃう! ● KTOアライメント🌟:AIが人間の好みに合わせて、編集してくれるってコト!

続きは「らくらく論文」アプリで

EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing

Umar Khalid / Kashif Munir / Hasan Iqbal / Azib Farooq / Jing Hua / Nazanin Rahnavard / Chen Chen / Victor Zhu / Zhengping Ji

Editing complex visual content from ambiguous or partially specified instructions remains a core challenge in vision-language modeling. Existing models can contextualize content but often fail to infer the underlying intent within a reference image or scene, leading to inconsistent or misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system that interprets ambiguous instructions in conjunction with reference visuals to produce precise, context-aware editing prompts. EVLM's key innovation is a reflective reasoning framework that translates subjective user intent into structured, actionable outputs by aligning with human-rated rationales through Reflection-Aware KL-Divergence Target Optimization (RKTO). By combining Chain-of-Thought (CoT) reasoning with RKTO alignment, EVLM captures fine-grained editing preferences without relying on binary supervision. Trained on a dataset of 30,000 CoT examples with human-annotated rationale quality, EVLM achieves substantial gains in alignment with human intent. Experiments across image, video, 3D, and 4D editing tasks show that EVLM generates coherent and high-quality instructions, providing a scalable foundation for multimodal editing and reasoning.

cs / cs.CV