Disentangled Speech Representation Learning for
One-Shot Cross-lingual Voice Conversion Using β-VAE

Abstract

We propose to unify one-shot voice conversion and cloning in a single model that can be optimized end-to-end. We adopt a variational auto-encoder (VAE) to disentangle speech into the content and speaker representations. Instead of imposing a fixed Gaussian prior for the content representation learning, we incorporate a novel learnable text-aware prior as an informative guide. In this way, the content representation can be better learned with less entangled speaker information and more linguistic information encoded. Since the content representation can be either sampled from the speech-conditioned posterior or the text-aware prior, voice conversion and cloning can be respectively achieved by combining the content representation with the speaker representation. We evaluate the proposed method for both one-shot voice conversion and cloning. Both the objective and subjective evaluations show the effectiveness of the proposed framework.


Converted speech samples

Experimental conditions

  • All compared models are trained on VCTK[1]
  • All the converted utterances are generated by Hifi-GAN vocoder [3] trained on VCTK.
  • All souce and target utterances listed below are unseen during training.

Compared models

  • VQMIVC: Baseline voice conversion method based on vector quantization and mutual information minimization[4].
  • VVC: The beta-VAE base voice conversion method

Results

English → English

Source Target VQMIVC AdIN-VC β-VAEVC
Sample 1
Sample 2
Sample 3
Sample 4

Mandarin → Mandarin

Source Target VQMIVC AdIN-VC β-VAEVC
Sample 1
Sample 2
Sample 3
Sample 4

English → Mandarin

Source Target VQMIVC AdIN-VC β-VAEVC
Sample 1
Sample 2
Sample 3
Sample 4

Mandarin → English

Source Target VQMIVC AdIN-VC β-VAEVC
Sample 1
Sample 2
Sample 3
Sample 4

References

[1] Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). 2019.

[2] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. AISHELL-3: A Multi-Speaker Mandarin TTS Corpus. in Proc. Interspeech 2021, pp. 2756–2760.

[3] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020

[4] Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng. VQMIVC: vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. in Interspeech 2021, pp. 1344–1348.

[5] Ju-Chieh Chou and Hung-yi Lee. One-shot voice conversion by separating speaker and content representations with instance normalization. in Interspeech 2019, pp. 664–668.