A Unified Framework for One-Shot Voice Conversion and Cloning Using Variational Auto-Encoder

0. Contents

  1. Abstract
  2. Comparison models and their implementations
  3. Voice Conversion Samples
  4. Voice Cloning Samples

1. Abstract

We propose to unify one-shot voice conversion and cloning in a single model that can be optimized end-to-end. We adopt a variational auto-encoder (VAE) to disentangle speech into the content and speaker representations. Instead of imposing a fixed Gaussian prior for the content representation learning, we incorporate a novel learnable text-aware prior as an informative guide. In this way, the content representation can be better learned with less entangled speaker information and more linguistic information encoded. Since the content representation can be either sampled from the speech-conditioned posterior or the text-aware prior, voice conversion and cloning can be respectively achieved by combining the content representation with the speaker representation. We evaluate the proposed method for both one-shot voice conversion and cloning. Both the objective and subjective evaluations show the effectiveness of the proposed framework.

2. Comparison models and their implementations

Below lists the compared models we used in our experiments. We adopt the same test set for all of them.

SC-GlowTTS: Google Colab

CDFSE: https://github.com/thuhcsi/interspeech2022-cdfse-tts

VQMIVC: https://github.com/Wendison/VQMIVC

VAE-GP: http://arxiv.org/abs/2210.13771

The below official Hifi-GAN pre-trained model is used.

Hifi-GAN (official): VCTK-V1

3. Voice Conversion Samples

ID Source Reference VQMIVC SC-GlowTTS VAE-GP VAE-TP(ours)
1
2
3
4
5
6
7
8
9
10
11
12

3. Voice Cloning Samples

(1) Its creed provides for the protection of all men in their rights of worship according to the dictates of conscience.

Reference SC-GlowTTS CDFSE VAE-TP(ours)

(2) He thought he detected a pleasant smell of herbs, like the potpourri his mother had in bowls in their house.

Reference SC-GlowTTS CDFSE VAE-TP(ours)

(3) Poor Rachel! her nature recoiled from deceit, and she told, at all events, as much of the truth as she dared.

Reference SC-GlowTTS CDFSE VAE-TP(ours)

(4) She lay awake very long this night, planning how to lessen the evil influence of their Milton life on her mother.

Reference SC-GlowTTS CDFSE VAE-TP(ours)

(5) When we came up I told Miller to shut the bank door, which they had left open in their hurry.

Reference SC-GlowTTS CDFSE VAE-TP(ours)

(6) Rodolfo meanwhile having returned home, and having missed the crucifix, guessed who had taken it, but gave himself no concern about it.

Reference SC-GlowTTS CDFSE VAE-TP(ours)

(7) Only, even though love has wholly disappeared, she still claims consideration, and Althea did not wish to lose Hermon’s regard.

Reference SC-GlowTTS CDFSE VAE-TP(ours)

(8) Very much of squalor and discomfort will be endured before the last trinket or the last pretense of pecuniary decency is put away.

Reference SC-GlowTTS CDFSE VAE-TP(ours)

(9) So the Castrato began to speak him fair and say to him, O my lord, take this purse and go with me.

Reference SC-GlowTTS CDFSE VAE-TP(ours)

(10) The marquis of Worcester, a man past eighty four, was the last in England that submitted to the authority of the parliament.

Reference SC-GlowTTS CDFSE VAE-TP(ours)

(11) I will briefly describe them to you, and you shall read the account of them at your leisure in the sacred registers.

Reference SC-GlowTTS CDFSE VAE-TP(ours)