VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis

0. Contents

Abstract
Comparison models and their implementations
Synthesized samples – Comparison with other models
Synthesized samples – Fixed reduction factors
synthesized samples – W/ V.S. W/O causality mask
Attention alignemnt convergence dynamics
Other results

1. Abstract

This paper describes a variational auto-encoder based non-autoregressive text-to-speech (VAENAR-TTS) model. The autoregressive TTS (AR-TTS) models based on the sequence-to-sequence architecture can generate high-quality speech, but their sequential decoding process can be time-consuming. Recently, non-autoregressive TTS (NAR-TTS) models have been shown to be more efficient with the parallel decoding process. However, these NAR-TTS models rely on phoneme-level durations to generate a hard alignment between the text and the spectrogram. Obtaining duration labels, either through force-alignment or knowledge distillation, is cumbersome. Furthermore, hard alignment based on phoneme expansion can degrade the naturalness of the synthesized speech. In contrast, the proposed model of VAENAR-TTS is an end-to-end approach that does not require phoneme-level duration. VAENAR-TTS model does not contain recurrent structures and is completely non-autoregressive in both the training and inference phase. Based on the VAE architecture, the alignment information is encoded in the latent variable, and the attention-based soft alignment between the text and the latent variable is used in the decoder to reconstruct the spectrogram. Experiments show that VAENAR-TTS achieves state-of-the-art synthesis quality, while the synthesis speed is comparable with other NAR-TTS models.

Source Codes will be released soon!

2. Comparison models and their implementations

Below lists the implementations we used in our experiments. These models are trained and evaluated with our own dataset separation configuration, while all other settings are the default.

Tacotron2: https://github.com/NVIDIA/tacotron2

FastSpeech2: https://github.com/ming024/FastSpeech2

BVAE-TTS (official): https://github.com/LEEYOONHYUNG/BVAE-TTS

Glow-TTS (official): https://github.com/jaywalnut310/glow-tts

The below official Hifi-GAN implementation is used.

Hifi-GAN (official): https://github.com/jik876/hifi-gan

3. Synthesized samples -- Comparison with other models

Below lists the samples that are synthesized for the subjective evaluation.

LJ003-0305. The provision of more baths was also suggested, and the daily sweeping out of the prison.

BVAE-TTS	FastSpeech2	Glow-TTS	Tacotron 2	VAENAR-TTS (ours)

LJ009-0046. But the attempt fails; he trembles, his knees knock together, and his head droops as he enters the condemned pew.

BVAE-TTS	FastSpeech2	Glow-TTS	Tacotron 2	VAENAR-TTS (ours)

LJ005-0100. For this purpose it kept up an extensive correspondence with all parts of the kingdom, and circulated queries to be answered in detail,

BVAE-TTS	FastSpeech2	Glow-TTS	Tacotron 2	VAENAR-TTS (ours)

LJ006-0206. and publications which in these days would have been made the subject of a criminal prosecution.

BVAE-TTS	FastSpeech2	Glow-TTS	Tacotron 2	VAENAR-TTS (ours)

LJ007-0177. We trust, however, that the day is at hand when this stain will be removed from the character of the city of London,

BVAE-TTS	FastSpeech2	Glow-TTS	Tacotron 2	VAENAR-TTS (ours)

LJ013-0081. Banks and bankers continued to be victimized.

BVAE-TTS	FastSpeech2	Glow-TTS	Tacotron 2	VAENAR-TTS (ours)

LJ038-0009. When he heard police sirens, he, quote, looked up and saw the man enter the lobby, end quote.

BVAE-TTS	FastSpeech2	Glow-TTS	Tacotron 2	VAENAR-TTS (ours)

LJ041-0099. Powers believed that when Oswald arrived in Japan he acquired a girlfriend, quote,

BVAE-TTS	FastSpeech2	Glow-TTS	Tacotron 2	VAENAR-TTS (ours)

LJ043-0071. His performance for that company was satisfactory.

BVAE-TTS	FastSpeech2	Glow-TTS	Tacotron 2	VAENAR-TTS (ours)

LJ047-0234. Hosty’s initial reaction on hearing that Oswald was a suspect in the assassination, was, quote, shock

BVAE-TTS	FastSpeech2	Glow-TTS	Tacotron 2	VAENAR-TTS (ours)

4. Synthesized samples -- Different reduction factors

Below lists samples synthesized by models with different fixed reduction factors. The evaluation results show that RF3 and RF4 are comparable and both is much better than RF5 in terms of speech naturalness

LJ003-0305. The provision of more baths was also suggested, and the daily sweeping out of the prison.

RF5	RF4	RF3

LJ009-0046. But the attempt fails; he trembles, his knees knock together, and his head droops as he enters the condemned pew.

RF5	RF4	RF3

LJ005-0100. For this purpose it kept up an extensive correspondence with all parts of the kingdom, and circulated queries to be answered in detail,

RF5	RF4	RF3

LJ006-0206. and publications which in these days would have been made the subject of a criminal prosecution.

RF5	RF4	RF3

LJ007-0177. We trust, however, that the day is at hand when this stain will be removed from the character of the city of London,

RF5	RF4	RF3

5. synthesized samples -- W/ V.S. W/O causality mask

Below lists the samples synthesized by VAENAR-TTS with and without causality mask in the self-attention structures that are stacked on the frame-level features. The repetition issues is very common in model w/o causality mask. Pay attention to the word(s) highlighted in red color.

LJ001-0133. One very important matter in “setting up” for fine printing is the “spacing,” that is, the lateral distance of words from one another.

W/ Causality Mask	W/O Causality Mask

LJ003-0238. and they were exacted to relieve a rich corporation from paying for the maintenance of their own prison.

W/ Causality Mask	W/O Causality Mask

LJ006-0006. I shall now return to the great jail of the city of London, and give a more detailed account of its condition and inner life

W/ Causality Mask	W/O Causality Mask*

LJ009-0208. erected on the cart, about four feet high at the head, and gradually sloping towards the horse, giving a full view of the body,

W/ Causality Mask	W/O Causality Mask

LJ014-0054. a maidservant, Sarah Thomas, murdered her mistress, an aged woman, by beating out her brains with a stone.

W/ Causality Mask	W/O Causality Mask

6. Attention Alignemnt convergence dynamics

Below are the attention alignment maps obtained during the training process of three models with different fixed reduction factors: 5, 4, and 3. Note that the alignment converges faster for the model with a larger reduction factor.

Reduction factor = 3

Reduction factor = 4

Reduction factor = 5

7. Other results

Synthesized results in Madarin

Trained on dataset DataBaker.

003095.失恋的人特别喜欢往人烟罕至的角落里钻。

003684.天安门广场，一派春意盎然。

004539.网友“甄巍峰”说，希望规范用工制度和用工秩序。

005134.我的性格也是逆来顺受，从小胆小怕事，凡事都忍让。

007732.乘客对车上负责讲解的导乘员普遍表示满意。