単語レベルで感情表現！すごいTTS🚀

Published：2026/1/11 2:20:15

単語レベルで感情表現！すごいTTS🚀

超要約：単語ごとに感情を操れるTTS技術、爆誕！

✨ ギャル的キラキラポイント ✨ ● 単語ごとに感情をコントロールできるって、まるで魔法🧙‍♀️！ ● 少ないデータでも、いろんな声で話せるのがスゴすぎ💖 ● バーチャルYouTuber（VTuber）とか、声優さんにも役立ちそうじゃん？

詳細解説いくよ～！

背景今までのTTS（テキスト・トゥ・スピーチ）は、文章全体でしか感情表現できなかったの😥。でも、この研究は単語レベルで感情をコントロールできるTTSを目指したんだって！例えば「嬉しい」って言葉を、すごくハッピーな声で言えるみたいな😍

続きは「らくらく論文」アプリで

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

Tianrui Wang / Haoyu Wang / Meng Ge / Cheng Gong / Chunyu Qiang / Ziyang Ma / Zikang Huang / Guanrou Yang / Xiaobao Wang / Eng Siong Chng / Xie Chen / Longbiao Wang / Jianwu Dang

While emotional text-to-speech (TTS) has made significant progress, most existing research remains limited to utterance-level emotional expression and fails to support word-level control. Achieving word-level expressive control poses fundamental challenges, primarily due to the complexity of modeling multi-emotion transitions and the scarcity of annotated datasets that capture intra-sentence emotional and prosodic variation. In this paper, we propose WeSCon, the first self-training framework that enables word-level control of both emotion and speaking rate in a pretrained zero-shot TTS model, without relying on datasets containing intra-sentence emotion or speed transitions. Our method introduces a transition-smoothing strategy and a dynamic speed control mechanism to guide the pretrained TTS model in performing word-level expressive synthesis through a multi-round inference process. To further simplify the inference, we incorporate a dynamic emotional attention bias mechanism and fine-tune the model via self-training, thereby activating its ability for word-level expressive control in an end-to-end manner. Experimental results show that WeSCon effectively overcomes data scarcity, achieving state-of-the-art performance in word-level emotional expression control while preserving the strong zero-shot synthesis capabilities of the original TTS model.

cs / eess.AS / cs.SD

Arxivで見る