超要約: テキストから超自然な声出す技術が進化!SFMってので、学習も速くて音質もアップしたよ🌟
ギャル的キラキラポイント✨
● ノイズからじゃなく、中間地点からスタートするから学習が効率的になったってこと!賢い〜! ● 声のクオリティが爆上がり!まるで本物みたいに聞こえるらしい🎤 ● 処理スピードも速くなったから、色んなサービスで活躍しそうじゃん?
詳細解説
続きは「らくらく論文」アプリで
We propose Shallow Flow Matching (SFM), a novel mechanism that enhances flow matching (FM)-based text-to-speech (TTS) models within a coarse-to-fine generation paradigm. Unlike conventional FM modules, which use the coarse representations from the weak generator as conditions, SFM constructs intermediate states along the FM paths from these representations. During training, we introduce an orthogonal projection method to adaptively determine the temporal position of these states, and apply a principled construction strategy based on a single-segment piecewise flow. The SFM inference starts from the intermediate state rather than pure noise, thereby focusing computation on the latter stages of the FM paths. We integrate SFM into multiple TTS models with a lightweight SFM head. Experiments demonstrate that SFM yields consistent gains in speech naturalness across both objective and subjective evaluations, and significantly accelerates inference when using adaptive-step ODE solvers. Demo and codes are available at https://ydqmkkx.github.io/SFMDemo/.