iconLogo
Published:2026/1/5 16:17:20

VIBE爆誕!画像編集を命令で💖

  1. 超要約: 言葉で画像編集!高性能AI「VIBE」登場!✨

  2. ギャル的キラキラポイント ● 命令文で画像編集!専門知識不要で誰でも簡単🎉 ● 低コスト&高品質!コスパ最強のエディタ💅 ● eコマース、SNS…使い道無限大で未来がアゲ🔥

  3. 詳細解説

    • 背景: 従来は難しかった画像編集が、AIのおかげで誰でも簡単にできるようになる時代が来たってワケ! VIBEは、自然言語(言葉)で画像を編集できるスゴいシステムを目指してるんだよね💖
    • 方法: Qwen3-VLとSana1.5ってモデルを使って、低コストで高性能な画像編集を実現してるの!「これ、可愛くして!」みたいな命令で、画像がどんどん変わるって、まさに魔法🪄✨
    • 結果: ImgEditベンチマークで2位、GEditベンチマークでも高スコア!編集精度もバッチリで、みんなが求めてたものがここに😍
    • 意義: eコマースでの商品画像編集とか、SNSの加工とか、色んな場面で役立つこと間違いなし!クリエイターだけじゃなく、一般ピーポーも画像編集を楽しめるようになるって、めっちゃ良くない?💖
  4. リアルでの使いみちアイデア

    • 推しの写真に、キラキラエフェクト追加してSNSにアップしよ💖
    • メルカリで、商品の写真をもっと可愛く加工して売上アップ⤴️

続きは「らくらく論文」アプリで

VIBE: Visual Instruction Based Editor

Grigorii Alekseenko / Aleksandr Gordeev / Irina Tolstykh / Bulat Suleimanov / Vladimir Dokholyan / Georgii Fedorov / Sergey Yakubson / Aleksandra Tsybina / Mikhail Chernyshov / Maksim Kuprashevich

Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target low-cost inference and strict source consistency while maintaining high quality across the major edit categories feasible at this scale. Evaluated on the ImgEdit and GEdit benchmarks, the proposed method matches or exceeds the performance of substantially heavier baselines, including models with several times as many parameters and higher inference cost, and is particularly strong on edits that require preserving the input image, such as an attribute adjustment, object removal, background edits, and targeted replacement. The model fits within 24 GB of GPU memory and generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.

cs / cs.CV / cs.AI / cs.LG