超要約: 音声認識と話者分離をLLMで統合!会議とかの会話を秒で理解する技術だよ🌟
● オーバーラップ(同時発話)も完璧に認識するってマジ神! ● 話者と発話内容の時間関係をビシッと特定!議事録作成も楽勝💖 ● LLM(大規模言語モデル)の力で、音声理解が爆速&高精度🚀
背景 最近のIT業界では、会議とかの複数人の会話をAIが理解する技術がめっちゃ重要になってきてるじゃん?でも、従来の技術だと、誰がいつ何を言ったのかを正確に把握するのが難しかったんだよね💦
方法 そこで登場したのが「TagSpeech」!ASR(音声認識)とDiarization(話者分離)をLLMっていうスゴい頭脳を使って合体させたんだって!誰が、いつ、何を話したかを一発で特定できるんだよ👀✨
続きは「らくらく論文」アプリで
We present TagSpeech, a unified LLM-based framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization. The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that not only supports fine-grained timestamp prediction but also acts as a synchronization signal between semantic understanding and speaker tracking. Compared to previous works that primarily focus on speaker-attributed ASR or implicit diarization, TagSpeech addresses the challenge of fine-grained speaker-content alignment and explicitly models "who spoke what and when" in an end-to-end manner. Experiments on AMI and AliMeeting benchmarks demonstrate that our method achieves consistent improvements in Diarization Error Rate (DER) over strong end-to-end baselines, including Qwen-Omni and Gemini, particularly in handling complex speech overlaps. Moreover, TagSpeech employs a parameter-efficient training paradigm in which the LLM backbone is frozen and only lightweight projectors are trained, resulting in strong performance with low computational cost.