音声検索が進化💖 音声＆テキストを融合！

Published：2025/12/16 5:58:25

音声検索が進化💖 音声＆テキストを融合！

超要約: 音声検索が激アツ！音声とテキストを合体させて、もっと賢く検索できるようにする研究だよ🌟
ギャル的キラキラポイント✨
- ● 音声とテキストを一緒に学習するから、色んな声やノイズにも強くなれるの！
- ● 1つのモデルでキーワード探しと、音声ターム検出（単語探し）ができるって、めっちゃ便利じゃん？
- ● 検索の精度が上がって、新しいサービスとかも生まれちゃうかも💕
詳細解説
- 背景: 今までの音声検索って、音声だけとかテキストだけで学習してたんだよね。でも、それだと色んな問題があったみたい🥺
- 方法: 音声とテキストを「多モーダルコントラスト学習」っていう方法で一緒に学習するんだって！同じ意味の言葉は似たような場所に来るようにするんだって✨
- 結果: いろんな声の人でも、騒がしい場所でも、ちゃんと検索できるようになるらしい🎵 しかも、1つのモデルで両方の検索ができるから、すごい😍
- 意義（ここがヤバい♡ポイント）: 音声検索が進化することで、新しいサービスが生まれたり、もっと色んな情報にアクセスできるようになるってこと！未来が楽しみだね🎶
リアルでの使いみちアイデア💡
- 動画サイトで、話してる内容をキーワードで検索できるようになるかも！気になる部分をすぐに見つけられるの、良くない？🤩
- 音声で操作できる家電とか、もっと賢くなって、色んなことできるようになりそう！

続きは「らくらく論文」アプリで

Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting

Ramesh Gundluru / Shubham Gupta / Sri Rama Murty K

Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.

cs / cs.SD / cs.LG

Arxivで見る