A Unified Framework for One-Shot Voice Conversion and Cloning Using Variational Auto-Encoder

0. Contents

  1. Abstract
  2. Comparison models and their implementations
  3. Voice Conversion Samples
  4. Voice Cloning Samples

1. Abstract

We propose to unify one-shot voice conversion and cloning in a single model that can be optimized end-to-end. We adopt a variational auto-encoder (VAE) to disentangle speech into the content and speaker representations. Instead of imposing a fixed Gaussian prior for the content representation learning, we incorporate a novel learnable text-aware prior as an informative guide. In this way, the content representation can be better learned with less entangled speaker information and more linguistic information encoded. Since the content representation can be either sampled from the speech-conditioned posterior or the text-aware prior, voice conversion and cloning can be respectively achieved by combining the content representation with the speaker representation. We evaluate the proposed method for both one-shot voice conversion and cloning. Both the objective and subjective evaluations show the effectiveness of the proposed framework.

2. Comparison models and their implementations

Below lists the implementations we used in our experiments. We adopt the same test set.

SC-GlowTTS: Google Colab

CDFSE: https://github.com/thuhcsi/interspeech2022-cdfse-tts

VQMIVC: https://github.com/Wendison/VQMIVC

The below official Hifi-GAN pre-trained model is used.

Hifi-GAN (official): VCTK-V1

3. Voice Conversion Samples

ID Source Reference VQMIVC SC-GlowTTS VVC VVCC(ours)
1
2
3
4
5
6
7
8
9
10
11
12
13
14

3. Voice Cloning Samples

  1. He knew now that his absence, for as long as he had to be away, would be covered up and satisfactorily accounted for.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. After all, the Mormon people regard the advent of the Buchanan army as one of the greatest material blessings ever brought to them.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Rodolfo meanwhile having returned home, and having missed the crucifix, guessed who had taken it, but gave himself no concern about it.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. While we were planning the raid it was intended that I should be one of the party to go into the bank.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. With thee have I wandered about in the remotest, coldest worlds, like a phantom that voluntarily haunteth winter roofs and snows.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Nay, I rather thrilled, Distrusting every light that seemed to gild The onward path, and feared to overlean A finger even.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. My new friend was a poet as I was; he was an admirer of Italian literature, while I admired the French.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. O life of this our spring! why fades the lotus of the water? Why fade these children of the spring?
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Very much of squalor and discomfort will be endured before the last trinket or the last pretense of pecuniary decency is put away.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Thus far the universal animal was made in the divine image, but the other animals were not as yet included in him.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Well, said Franz with a sigh, do as you please my dear viscount, for your arguments are beyond my powers of refutation.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Forthwith the grinning Jester began shrieking with laughter, so that the bells upon his motley cap were all set a jangling.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. However loudly outward circumstances might oppose this, he now felt, with a certainty which surprised him, that this work was not his own.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. I will briefly describe them to you, and you shall read the account of them at your leisure in the sacred registers.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. One perceives, without understanding it, a hideous murmur, sounding almost like human accents, but more nearly resembling a howl than an articulate word.
Reference SC-GlowTTS CDFSE VVCC(ours)