Unifying One-Shot Voice Conversion and Cloning with Disentangled Speech Representations

Hui Lu, Xixin Wu, Haohan Guo, Songxiang Liu, Zhiyong Wu, Helen Meng

0. Contents

  1. Abstract
  2. Comparison models and their implementations
  3. Voice Conversion Samples
  4. Voice Cloning Samples

1. Abstract

We propose unifying one-shot voice conversion and cloning into a single model that can be end-to-end optimized. To achieve this, we introduce a novel extension to a speech variational auto-encoder (VAE) that disentangles speech into content and speaker representations. Instead of using a fixed Gaussian prior as in the vanilla VAE, we incorporate a learnable text-aware prior as an informative guide for learning the content representation. This results in a content representation with reduced speaker information and more accurate linguistic information. The proposed model can sample the content representation using either the posterior conditioned on speech or the text-aware prior with textual input, enabling one-shot voice conversion and cloning, respectively. Experiments show that the proposed method achieves better or comparable overall performance for one-shot voice conversion and cloning compared to state-of-the-art voice conversion and cloning methods.

2. Comparison models and their implementations

Below lists the implementations we used in our experiments. We adopt the same test set.

SC-GlowTTS: Google Colab

CDFSE: https://github.com/thuhcsi/interspeech2022-cdfse-tts

VQMIVC: https://github.com/Wendison/VQMIVC

The below official Hifi-GAN pre-trained model is used.

Hifi-GAN (official): VCTK-V1

3. Voice Conversion Samples

ID Source Reference VQMIVC SC-GlowTTS VVC VVCC(ours)
1
2
3
4
5
6
7
8
9
10
11
12
13
14

3. Voice Cloning Samples

  1. He knew now that his absence, for as long as he had to be away, would be covered up and satisfactorily accounted for.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. After all, the Mormon people regard the advent of the Buchanan army as one of the greatest material blessings ever brought to them.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Rodolfo meanwhile having returned home, and having missed the crucifix, guessed who had taken it, but gave himself no concern about it.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. While we were planning the raid it was intended that I should be one of the party to go into the bank.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. With thee have I wandered about in the remotest, coldest worlds, like a phantom that voluntarily haunteth winter roofs and snows.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Nay, I rather thrilled, Distrusting every light that seemed to gild The onward path, and feared to overlean A finger even.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. My new friend was a poet as I was; he was an admirer of Italian literature, while I admired the French.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. O life of this our spring! why fades the lotus of the water? Why fade these children of the spring?
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Very much of squalor and discomfort will be endured before the last trinket or the last pretense of pecuniary decency is put away.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Thus far the universal animal was made in the divine image, but the other animals were not as yet included in him.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Well, said Franz with a sigh, do as you please my dear viscount, for your arguments are beyond my powers of refutation.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. Forthwith the grinning Jester began shrieking with laughter, so that the bells upon his motley cap were all set a jangling.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. However loudly outward circumstances might oppose this, he now felt, with a certainty which surprised him, that this work was not his own.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. I will briefly describe them to you, and you shall read the account of them at your leisure in the sacred registers.
Reference SC-GlowTTS CDFSE VVCC(ours)
  1. One perceives, without understanding it, a hideous murmur, sounding almost like human accents, but more nearly resembling a howl than an articulate word.
Reference SC-GlowTTS CDFSE VVCC(ours)