InteractiveOmni！音声と動画で会話するスゴAI！

Published：2025/12/3 13:16:15

InteractiveOmni、爆誕！音声と動画で会話するAI✨

タイトル & 超要約 InteractiveOmni！音声&動画で会話するスゴAI！
ギャル的キラキラポイント
- ● 音声＆動画で会話😳マルチな対話！
- ● 人間みたい！感情表現もできちゃう！
- ● IT業界に革命！未来が楽しみ💖
詳細解説
- 背景最近のLLM（大規模言語モデル）ってすごいじゃん？✨ でも、文字だけじゃなくて、音声とか動画も扱えたら、もっとすごいよね！ってことで、InteractiveOmniちゃんが生まれたってわけ💖
- 方法音声、画像、動画、テキスト…全部まとめて理解しちゃうスーパーモデル！🤯 いろんな情報から、人間みたいに会話できるんだって！長期記憶とか感情表現も得意みたい😳

続きは「らくらく論文」アプリで

InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue

Wenwen Tong / Hewei Guo / Dongchuan Ran / Jiangnan Chen / Jiefan Lu / Kaibin Wang / Keqiang Li / Xiaoxu Zhu / Jiakui Li / Kehan Li / Xueheng Li / Lumin Li / Chenxu Guo / Jiasheng Zhou / Jiandong Chen / Xianye Wu / Jiahao Wang / Silei Wu / Lei Chen / Hanming Deng / Yuxuan Song / Dinghao Zhou / Guiping Zhong / Ken Zheng / Shiyin Kang / Lewei Lu

We introduce InteractiveOmni, a unified and open-source omni-modal large language model for audio-visual multi-turn interaction, ranging from 4B to 8B parameters, designed to lead the field of lightweight models by offering comprehensive omni-modal understanding and speech generation capabilities. To achieve this, we integrate the vision encoder, audio encoder, large language model, and speech decoder into a unified model for understanding and generation tasks. We design a multi-stage training strategy to ensure robust cross-modal capabilities, including pre-training for omni-modal understanding, followed by post-training with speech conversation and audio-visual interaction. To enable human-like long-term conversational ability, we meticulously curate a multi-turn training dataset that enhances the model's ability to handle complex and multi-turn interactions. To effectively evaluate the multi-turn memory and speech interaction capabilities, we construct the multi-modal multi-turn memory benchmark and the multi-turn speech interaction benchmark. Experiments demonstrate that InteractiveOmni significantly outperforms leading open-source models and provides a more intelligent multi-turn audio-visual experience, particularly in its long-term memory capabilities. Notably, InteractiveOmni-4B is comparable to the much larger model like Qwen2.5-Omni-7B on general benchmarks, and it can retain 97% of the performance of the InteractiveOmni-8B while utilizing only 50% of the model size. Achieving state-of-the-art results against similarly sized models across image, audio, video understanding, and speech generation tasks, InteractiveOmni is an accessible, open-source foundation for next-generation intelligent interactive systems.

cs / cs.CV

Arxivで見る