UTILIZING NEURAL TRANSDUCERS FOR TWO-STAGE TEXT-TO-SPEECH VIA SEMANTIC TOKEN PREDICTION

Authors: Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Semin Kim, Joun Yeop Lee, and Nam Soo Kim

Abstract

We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token transducer for the semantic token prediction, benefiting from its hard monotonic alignment constraints. Subsequently, a non-autoregressive (NAR) speech generator efficiently synthesizes waveforms from these semantic tokens. Additionally, a reference speech controls temporal dynamics and acoustic conditions at each stage. This decoupled framework reduces the training complexity of TTS while allowing each stage to focus on semantic and acoustic modeling. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity, both objectively and subjectively. We also delve into the inference speed and prosody control capabilities of our approach, highlighting the potential of neural transducers in TTS frameworks.


Zero-Shot Adaptive TTS
[Case : $\textbf{s}_{ref}^{tran} == \textbf{s}_{ref}^{gen}$]

$\textbf{s}_{ref}^{tran}$ : reference speech for token transducer
$\textbf{s}_{ref}^{gen}$ : reference speech for speech generator


The models are trained on LibriTTS (train-clean-100, train-clean-360, train-other-500).
Audio samples are sampled from the test-clean and test-other subsets, ensuring no overlap with speakers in the training set.


Text

I allude to the goddess.

Ground Truth

Reference (Prompt)


VITS

VALL-E

Proposed-lstm

Proposed-conformer

Larch, by refusing to appear, practically admitted the charges against him and did not oppose the separation.


Once across that, they would be out of his power, but it seemed impossible to cross.




Text

Bartley leaned over her shoulder, without touching her, and whispered in her ear: "You are giving me a chance?"

Ground Truth

Reference (Prompt)


VITS

VALL-E

Proposed-lstm

Proposed-conformer

Dusk came two hours before its time; thunder snarled in the sky.


'What are you laughing at?' asked Wylder, a little snappishly.




Text

This frame was on rollers, so that it could be placed directly underneath the knife.

Ground Truth

Reference (Prompt)


VITS

VALL-E

Proposed-lstm

Proposed-conformer

"And what demonstration do you offer," asked Servadac eagerly, "that it will not happen?"


It began with a thin scratch and ended in a jagged hole.



Paralinguistic Controllability
[Case : $\textbf{s}_{ref}^{tran} \neq \textbf{s}_{ref}^{gen}$]

$\textbf{s}_{ref}^{tran}$ : reference speech for token transducer
$\textbf{s}_{ref}^{gen}$ : reference speech for speech generator


$\textbf{s}_{ref}^{tran}$ controllability

We fix the $\textbf{s}_{ref}^{gen}$, and allowed only the $\textbf{s}_{ref}^{tran}$ to control paralinguistic characteristics.
- Temporal characteristics (e.g. speech rate, prosody) of $\textbf{s}_{ref}^{tran}$ are reflected in the generated speech.
- Global characteristics (e.g. speaker, noise) of $\textbf{s}_{ref}^{gen}$ are reflected in the generated speech.


Text

She changed color for a moment, and looked at him, with a pretty, reluctant tenderness, as she took her chair.

$\textbf{s}_{ref}^{gen}$


$\textbf{s}_{ref}^{tran}$ 1

synthesized 1

$\textbf{s}_{ref}^{tran}$ 2

synthesized 2

$\textbf{s}_{ref}^{tran}$ 3

synthesized 3

$\textbf{s}_{ref}^{tran}$ 4

synthesized 4


Text

Their mental characteristics are likewise very distinct; chiefly as it would appear in their emotional, but partly in their intellectual faculties.

$\textbf{s}_{ref}^{gen}$


$\textbf{s}_{ref}^{tran}$ 1

synthesized 1

$\textbf{s}_{ref}^{tran}$ 2

synthesized 2

$\textbf{s}_{ref}^{tran}$ 3

synthesized 3

$\textbf{s}_{ref}^{tran}$ 4

synthesized 4


Text

As I went back to the field hospital, I overtook another man walking along.

$\textbf{s}_{ref}^{gen}$


$\textbf{s}_{ref}^{tran}$ 1

synthesized 1

$\textbf{s}_{ref}^{tran}$ 2

synthesized 2

$\textbf{s}_{ref}^{tran}$ 3

synthesized 3

$\textbf{s}_{ref}^{tran}$ 4

synthesized 4



$\textbf{s}_{ref}^{gen}$ controllability

We fix the generated semantic tokens from $\textbf{s}_{ref}^{tran}$, and allowed only the $\textbf{s}_{ref}^{gen}$ to control paralinguistic characteristics.
- Temporal characteristics (e.g. speech rate, prosody) of $\textbf{s}_{ref}^{tran}$ are reflected in the generated speech.
- Global characteristics (e.g. speaker, noise) of $\textbf{s}_{ref}^{gen}$ are reflected in the generated speech.


Text

Immediately beyond the lip of the ledge the hawk lifted his wings high over his back and struck downward, so that his talons went deep into the water.

$\textbf{s}_{ref}^{tran}$


$\textbf{s}_{ref}^{gen}$ 1

synthesized 1

$\textbf{s}_{ref}^{gen}$ 2

synthesized 2

$\textbf{s}_{ref}^{gen}$ 3

synthesized 3

$\textbf{s}_{ref}^{gen}$ 4

synthesized 4


Text

The fisher of the chutes, meanwhile, was swimming straight downstream for the broken water.

$\textbf{s}_{ref}^{tran}$


$\textbf{s}_{ref}^{gen}$ 1

synthesized 1

$\textbf{s}_{ref}^{gen}$ 2

synthesized 2

$\textbf{s}_{ref}^{gen}$ 3

synthesized 3

$\textbf{s}_{ref}^{gen}$ 4

synthesized 4


Text

The letter that Philip Sterling wrote to ruth Bolton, on the evening of setting out to seek his fortune in the west, found that young lady in her own father's house in Philadelphia.

$\textbf{s}_{ref}^{tran}$


$\textbf{s}_{ref}^{gen}$ 1

synthesized 1

$\textbf{s}_{ref}^{gen}$ 2

synthesized 2

$\textbf{s}_{ref}^{gen}$ 3

synthesized 3

$\textbf{s}_{ref}^{gen}$ 4

synthesized 4