Omni2Sound爆誕！動画から音声生成AI最強説✨

Published：2026/1/11 13:07:34

Omni2Sound爆誕！動画から音声生成AI最強説✨

超要約: 動画＆テキストから、高品質な音声を生成するAI「Omni2Sound」がすごい！✨
ギャル的キラキラポイント✨
- ● 動画とテキスト、どっちからでも音声作れるのが神！🙏
- ● 音質のクオリティが、まじでプロレベルなんだって！😳
- ● 映画とかゲームとか、色んな分野で活躍する未来がアツい🔥
詳細解説
- 背景: 今までの音声生成AIは、動画だけorテキストだけだったり、クオリティに課題があったの😥 でも、Omni2Soundは、動画とテキスト両方から、最高音質の音声を作れるんだって！
- 方法: まず、高品質な音声キャプションデータセット「SoundAtlas」を作ったよ！✨ それを使って、V2A (動画→音声)、T2A (テキスト→音声)、VT2A (動画＆テキスト→音声) を全部こなせる「Omni2Sound」を開発したんだって！
- 結果: どんな入力でも、超自然でリアルな音声が生成できるようになったみたい！😳特に、動画と音声のズレとか、オフスクリーンの音声の質が、めっちゃ改善されたらしい！
- 意義: これで、動画編集とかゲーム制作とか、色んな場面で、ハイクオリティな音声が手軽に作れるようになるってこと！🎉 表現の幅が爆上がりする予感！
リアルでの使いみちアイデア💡
- 動画編集アプリに搭載して、動画の内容に合ったBGMとか効果音を自動生成できるようにするの、良くない？😍
- ゲームとかVR/ARで、臨場感あふれるサウンドスケープを簡単に作れるようになったら、絶対楽しいよね！✨

続きは「らくらく論文」アプリで

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Yusheng Dai / Zehua Chen / Yuxuan Jiang / Baolong Gao / Qiuhong Ke / Jun Zhu / Jianfei Cai

Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight A-V-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5 times cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions. The project page is at https://swapforward.github.io/Omni2Sound.

cs / cs.SD / cs.CV / cs.MM

Arxivで見る