Disentangled Speech Representation Learning for One-Shot Voice Conversion and Cloning

0. Contents

  1. Abstract
  2. Samples

1. Abstract

Learning disentangled speaker and content representations is a fundamental question for speech representation learning. Existing methods typically apply the auto-encoder to perform disentanglement, while the key is to eliminate the speaker information from the content representation. Our solution is to adopt the recently proposed $\beta$-VAE and use the weight $\beta$ for the Kullback-Leibler divergence to restrict the information captured by the content representation. In addition to the evidence lower bound objective, a discriminative objective to regularize the speaker representation is also adopted. We show quantitatively that with a proper $\beta$, the speaker identity can be largely eliminated from the content representation, with linguistic information mostly retained. Qualitative evaluations show that one-shot voice conversion, which is a direct application of disentangled representation learning, can be achieved with satisfactory performance. In addition, we show that texts can be further utilized to regularize the content representation to render it more speaker-invariant and content-related. We implement this by a text-conditioned prior for content representation, which enables high-quality one-shot voice cloning as a byproduct. Our code will be released when the paper is published.

2. Samples

Source Referfence VQMIVC VC (β=1e-3) VC (β=1.2e-2) VC (β=3e-2) VCT (β=2.5e-5) VCT (β=2.5e-4) VCT (β=5e-4) VCT-Cloning (β=2.5e-4) Text          
Throughout the centuries people have explained the rainbow in various ways.
Throughout the centuries people have explained the rainbow in various ways.
Throughout the centuries people have explained the rainbow in various ways.
Corporate banking would be based in Edinburgh.
Corporate banking would be based in Edinburgh.
Corporate banking would be based in Edinburgh.
People look, but no one ever finds it.
People look, but no one ever finds it.
People look, but no one ever finds it.
My husband was fighting for his life.
My husband was fighting for his life.
My husband was fighting for his life.
The money is not really our concern.
The money is not really our concern.
The money is not really our concern.
He would just say it was terrible.
He would just say it was terrible.
He would just say it was terrible.
His condition was serious but stable.
His condition was serious but stable.
His condition was serious but stable.
We did discuss the Lockerbie issue.
We did discuss the Lockerbie issue.
We did discuss the Lockerbie issue.