MM-Sonate爆誕！ゼロショ音声クローンで動画が激変💖

Published：2026/1/4 15:26:15

MM-Sonate爆誕！ゼロショ音声クローンで動画が激変💖

超要約: テキストとか画像から、声も動画も最強にイケてるAI爆誕！話者の声マネも秒だよ☆

✨ ギャル的キラキラポイント ✨ ● ゼロショ（事前学習なし）で声マネできるって神！推しの声で動画作れる時代到来✨ ● 音声と動画のズレをなくす技術がすごい！まさに秒速シンクロやん？ ● テキスト、画像、音声… いろんな情報から動画作れちゃうの、クリエイター爆アゲ！

詳細解説いくねー！背景動画制作って大変じゃん？でもこのAI、声質（声の個性）をそのままに、動画をサクサク作れちゃうんだって！既存の技術じゃ難しかった声マネも、ゼロから学習しなくてもOKってマジ！？

方法テキスト、画像、参照音声（マネしたい声）をAIにぶち込む！すると、AIが音質を分析して、動画とピッタリ合った音声を生成してくれるんだって！ノイズも上手に使って、音質も爆上がり！

続きは「らくらく論文」アプリで

MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Chunyu Qiang / Jun Wang / Xiaopeng Wang / Kang Yin / Yuxin Guo / Xijuan Zeng / Nan Li / Zihan Li / Yuzhe Liang / Ziyu Zhang / Teng Ma / Yushen Chen / Zhongliang Liu / Feng Deng / Chen Zhang / Pengfei Wan

Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.

cs / cs.SD / cs.AI / cs.CV / cs.MM / eess.AS

Arxivで見る