超要約: 音声のニセモノ(ディープフェイク)を、AIが音とテキストを組み合わせて見抜く研究だよ!
● MLLMってすごい! 音声とテキストの両方を理解できるから、より賢くディープフェイクを見破れるかも💖 ● 説明もしてくれる! なんでニセモノだって分かったのか、理由も教えてくれるから、めっちゃ分かりやすい♪ ● セキュリティ対策に役立つ! 音声認証とか、偽情報対策とか、色んな場面で活躍できそうじゃん?
続きは「らくらく論文」アプリで
While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.