UTILIZING NEURAL TRANSDUCERS FOR TWO-STAGE TEXT-TO-SPEECH VIA SEMANTIC TOKEN PREDICTION

Zero-Shot Adaptive TTS
[Case : $\textbf{s}_{ref}^{tran} == \textbf{s}_{ref}^{gen}$]

$\textbf{s}_{ref}^{tran}$ : reference speech for token transducer
$\textbf{s}_{ref}^{gen}$ : reference speech for speech generator

The models are trained on LibriTTS (train-clean-100, train-clean-360, train-other-500).
Audio samples are sampled from the test-clean and test-other subsets, ensuring no overlap with speakers in the training set.

Text

I allude to the goddess.

Ground Truth

Reference (Prompt)

VITS

VALL-E

Proposed-lstm

Proposed-conformer

Larch, by refusing to appear, practically admitted the charges against him and did not oppose the separation.

Once across that, they would be out of his power, but it seemed impossible to cross.

Text

Bartley leaned over her shoulder, without touching her, and whispered in her ear: "You are giving me a chance?"

Ground Truth

Reference (Prompt)

VITS

VALL-E

Proposed-lstm

Proposed-conformer

Dusk came two hours before its time; thunder snarled in the sky.

'What are you laughing at?' asked Wylder, a little snappishly.

Text

This frame was on rollers, so that it could be placed directly underneath the knife.

Ground Truth

Reference (Prompt)

VITS

VALL-E

Proposed-lstm

Proposed-conformer

"And what demonstration do you offer," asked Servadac eagerly, "that it will not happen?"

It began with a thin scratch and ended in a jagged hole.

Paralinguistic Controllability
[Case : $\textbf{s}_{ref}^{tran} \neq \textbf{s}_{ref}^{gen}$]

$\textbf{s}_{ref}^{tran}$ : reference speech for token transducer
$\textbf{s}_{ref}^{gen}$ : reference speech for speech generator

$\textbf{s}_{ref}^{tran}$ controllability

We fix the $\textbf{s}_{ref}^{gen}$, and allowed only the $\textbf{s}_{ref}^{tran}$ to control paralinguistic characteristics.
- Temporal characteristics (e.g. speech rate, prosody) of $\textbf{s}_{ref}^{tran}$ are reflected in the generated speech.
- Global characteristics (e.g. speaker, noise) of $\textbf{s}_{ref}^{gen}$ are reflected in the generated speech.

Text

She changed color for a moment, and looked at him, with a pretty, reluctant tenderness, as she took her chair.

$\textbf{s}_{ref}^{gen}$

$\textbf{s}_{ref}^{tran}$ 1

synthesized 1

$\textbf{s}_{ref}^{tran}$ 2

synthesized 2

$\textbf{s}_{ref}^{tran}$ 3

synthesized 3

$\textbf{s}_{ref}^{tran}$ 4

synthesized 4

Text

Their mental characteristics are likewise very distinct; chiefly as it would appear in their emotional, but partly in their intellectual faculties.

$\textbf{s}_{ref}^{gen}$

$\textbf{s}_{ref}^{tran}$ 1

synthesized 1

$\textbf{s}_{ref}^{tran}$ 2

synthesized 2

$\textbf{s}_{ref}^{tran}$ 3

synthesized 3

$\textbf{s}_{ref}^{tran}$ 4

synthesized 4

Text

As I went back to the field hospital, I overtook another man walking along.

$\textbf{s}_{ref}^{gen}$

$\textbf{s}_{ref}^{tran}$ 1

synthesized 1

$\textbf{s}_{ref}^{tran}$ 2

synthesized 2

$\textbf{s}_{ref}^{tran}$ 3

synthesized 3

$\textbf{s}_{ref}^{tran}$ 4

synthesized 4

$\textbf{s}_{ref}^{gen}$ controllability

We fix the generated semantic tokens from $\textbf{s}_{ref}^{tran}$, and allowed only the $\textbf{s}_{ref}^{gen}$ to control paralinguistic characteristics.
- Temporal characteristics (e.g. speech rate, prosody) of $\textbf{s}_{ref}^{tran}$ are reflected in the generated speech.
- Global characteristics (e.g. speaker, noise) of $\textbf{s}_{ref}^{gen}$ are reflected in the generated speech.

Text

Immediately beyond the lip of the ledge the hawk lifted his wings high over his back and struck downward, so that his talons went deep into the water.

$\textbf{s}_{ref}^{tran}$

$\textbf{s}_{ref}^{gen}$ 1

synthesized 1

$\textbf{s}_{ref}^{gen}$ 2

synthesized 2

$\textbf{s}_{ref}^{gen}$ 3

synthesized 3

$\textbf{s}_{ref}^{gen}$ 4

synthesized 4

Text

The fisher of the chutes, meanwhile, was swimming straight downstream for the broken water.

$\textbf{s}_{ref}^{tran}$

$\textbf{s}_{ref}^{gen}$ 1

synthesized 1

$\textbf{s}_{ref}^{gen}$ 2

synthesized 2

$\textbf{s}_{ref}^{gen}$ 3

synthesized 3

$\textbf{s}_{ref}^{gen}$ 4

synthesized 4

Text

The letter that Philip Sterling wrote to ruth Bolton, on the evening of setting out to seek his fortune in the west, found that young lady in her own father's house in Philadelphia.

$\textbf{s}_{ref}^{tran}$

$\textbf{s}_{ref}^{gen}$ 1

synthesized 1

$\textbf{s}_{ref}^{gen}$ 2

synthesized 2

$\textbf{s}_{ref}^{gen}$ 3

synthesized 3

$\textbf{s}_{ref}^{gen}$ 4

synthesized 4