Voice Conversion

Mel-spectrogram augmentation for sequence to sequence voice conversion

6 Jan 2020 • makcedward/nlpaug •

In addition, we proposed new policies (i. e., frequency warping, loudness and time length control) for more data variations.

1,237

SPEECH SYNTHESIS VOICE CONVERSION

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

23 Sep 2017 • r9y9/gantts •

In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.

438

STYLE TRANSFER VOICE CONVERSION

Unsupervised Speech Decomposition via Triple Information Bottleneck

ICML 2020 • auspicious3000/autovc •

Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.

380

STYLE TRANSFER VOICE CONVERSION

AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

14 May 2019 • auspicious3000/autovc •

On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.

380

AUDIO GENERATION VOICE CONVERSION

Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion

NeurIPS 2019 • liusongxiang/StarGAN-Voice-Conversion •

End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.

297

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

6 Jun 2018 • liusongxiang/StarGAN-Voice-Conversion •

This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN.

297

SELF-SUPERVISED LEARNING SPEAKER VERIFICATION VOICE CONVERSION

Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning

5 Jun 2020 • andi611/Self-Supervised-Speech-Pretraining-and-Representation-Learning •

To explore this issue, we proposed to employ Mockingjay, a self-supervised learning based model, to protect anti-spoofing models against adversarial attacks in the black-box scenario.

238

TEST RESULTS VOICE CONVERSION

MOSNet: Deep Learning based Objective Assessment for Voice Conversion

17 Apr 2019 • aliutkus/speechmetrics

In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech.

231

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

10 Apr 2019 • jjery2243542/adaptive_voice_conversion •

Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers.

207

Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations

9 Apr 2018 • jjery2243542/voice_conversion •

The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance.

190

Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks

4 Apr 2017 • JeremyCCHsu/vae-npvc •

Building a voice conversion (VC) system from non-parallel speech corpora is challenging but highly valuable in real application scenarios.

120

Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder

13 Oct 2016 • JeremyCCHsu/vae-npvc •

We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora.

120

Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data

7 May 2020 • mindslab-ai/cotatron •

We propose Cotatron, a transcription-guided speech encoder for speaker-independent linguistic representation.

108

ACOUSTIC UNIT DISCOVERY VOICE CONVERSION

Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge

19 May 2020 • bshall/ZeroSpeech •

The idea is to learn a representation of speech by predicting future acoustic units.

Ranked #1 on Acoustic Unit Discovery on ZeroSpeech 2019 English

ADVERSARIAL TRAINING VOICE CONVERSION

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

28 May 2019 • andi611/ZeroSpeech-TTS-without-T •

We found that the proposed encoding method offers automatic extraction of speech content from speaker style, and is sufficient to cover full linguistic content in a given language.

MUSIC STYLE TRANSFER VOICE CONVERSION

MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms

8 Oct 2019 • marcoppasini/MelGAN-VC •

We propose MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice.

StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

29 Jul 2019 • SamuelBroughton/StarGAN-Voice-Conversion-2 •

To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2.

ADVERSARIAL TRAINING VOICE CONVERSION

Unsupervised Representation Disentanglement using Cross Domain Features and Adversarial Learning in Variational Autoencoder based Voice Conversion

22 Jan 2020 • unilight/cdvae-vc •

In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech.

Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders

29 Aug 2018 • unilight/cdvae-vc •

An effective approach to non-parallel voice conversion (VC) is to utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), to model the latent structure of speech in an unsupervised manner.

HYPERPARAMETER OPTIMIZATION ROBUST SPEECH RECOGNITION SPEAKER VERIFICATION VOICE CONVERSION

Scalable Factorized Hierarchical Variational Autoencoder Training

9 Apr 2018 • wnhsu/ScalableFHVAE •

Deep generative models have achieved great success in unsupervised learning with the ability to capture complex nonlinear relationships between latent generating factors and observations.

CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

9 Apr 2019 • jackaduma/CycleGAN-VC2 •

Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data.

Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

30 Nov 2017 • jackaduma/CycleGAN-VC2 •

A subjective evaluation showed that the quality of the converted speech was comparable to that obtained with a Gaussian mixture model-based method under advantageous conditions with parallel and twice the amount of data.

SPEAKER VERIFICATION SPEECH SYNTHESIS VOICE CONVERSION

Deep Residual Neural Networks for Audio Spoofing Detection

30 Jun 2019 • nesl/asvspoof2019

Additionally, replay attacks where the attacker uses a speaker to replay a previously recorded genuine human speech are also possible.

FEATURE ENGINEERING VOICE CONVERSION

ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks

1 Apr 2019 • jefflai108/ASSERT •

We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT).

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

24 Jul 2019 • patrickltobing/cyclevae-vc •

In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized.

ADVERSARIAL TRAINING QUANTIZATION VOICE CONVERSION

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture

7 Jun 2020 • ericwudayi/SkipVQVC •

Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content.

VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

10 Aug 2020 • KunZhou9646/Singing-Voice-Conversion-with-conditional-VAW-GAN •

We train an encoder to disentangle singer identity and singing prosody (F0 contour) from phonetic content.

STYLE TRANSFER VOICE CONVERSION

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder

15 Apr 2020 • CODEJIN/AutoVC •

Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice.

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

30 Oct 2018 • b04901014/ISGAN •

This paper focuses on using voice conversion (VC) to improve the speech intelligibility of surgical patients who have had parts of their articulators removed.

ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder

13 Aug 2018 • aoixcat/ACVAE-VC •

Such situations can be avoided by introducing an auxiliary classifier and training the encoder and decoder so that the attribute classes of the decoder outputs are correctly predicted by the classifier.

TRANSFER LEARNING VOICE CONVERSION

Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion

30 Sep 2020 • cjerry1243/TransferLearning-CLVC •

Cross-lingual voice conversion (VC) is a task that aims to synthesize target voices with the same content while source and target speakers speak in different languages.

CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion

18 Aug 2020 • Maitreyapatel/speech-conversion-between-different-modalities •

The CycleGAN-based method uses two different models, one for Mel Cepstral Coefficients (MCC) mapping, and another for F0 prediction, where F0 is highly dependent on the pre-trained model of MCC mapping.

LATENT VARIABLE MODELS UNSUPERVISED REPRESENTATION LEARNING VOICE CONVERSION

Robust Training of Vector Quantized Bottleneck Models

18 May 2020 • distsup/DistSup •

We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs.

EMOTION RECOGNITION SPEECH RECOGNITION VOICE CONVERSION

Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants

9 Aug 2019 • RanyaJumah/PP_Speech_Analysis •

The voice signal is a rich resource that discloses several possible states of a speaker, such as emotional state, confidence and stress levels, physical condition, age, gender, and personal traits.

Voice Conversion using Convolutional Neural Networks

27 Oct 2016 • ShariqM/smcnn

The human auditory system is able to distinguish the vocal source of thousands of speakers, yet not much is known about what features the auditory system uses to do this.

SPEECH RECOGNITION VOICE CONVERSION

Vocoder-free End-to-End Voice Conversion with Transformer Network

5 Feb 2020 • kaen2891/kaen2891.github.io

The additional pre/post processing such as MFB and vocoder is not essential to convert real human speech to others.

ADVERSARIAL TRAINING SPEECH RECOGNITION VOICE CONVERSION

Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition

19 May 2020 • Kohei-Matsuura/Non-parallel-VC-on-Mboshi •

We evaluated this speaker adaptation approach on two low-resource corpora, namely, Ainu and Mboshi.

SPEECH SYNTHESIS TEXT-TO-SPEECH SYNTHESIS VOICE CONVERSION

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

28 Feb 2020 • rhoposit/MOS_Estimation •

Our NN predicts MOS with a high correlation to human judgments.

SPEECH SYNTHESIS VOICE CONVERSION

STC Antispoofing Systems for the ASVspoof2019 Challenge

11 Apr 2019 • ozora-ogino/LCNN •

We enhanced Light CNN architecture previously considered by the authors for replay attacks detection and which performed high spoofing detection quality during the ASVspoof2017 challenge.