Voice Conversion
Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.
Benchmarks
Greatest papers with code
The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
Mel-spectrogram augmentation for sequence to sequence voice conversion
In addition, we proposed new policies (i. e., frequency warping, loudness and time length control) for more data variations.
Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks
In the proposed framework incorporating the GANs, the discriminator is trained to distinguish natural and generated speech parameters, while the acoustic models are trained to minimize the weighted sum of the conventional minimum generation loss and an adversarial loss for deceiving the discriminator.
Unsupervised Speech Decomposition via Triple Information Bottleneck
Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm.
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
On the other hand, CVAE training is simple but does not come with the distribution-matching property of a GAN.
Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion
End-to-end models for raw audio generation are a challenge, specially if they have to work with non-parallel data, which is a desirable setup in many situations.
StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks
This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN.
Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning
To explore this issue, we proposed to employ Mockingjay, a self-supervised learning based model, to protect anti-spoofing models against adversarial attacks in the black-box scenario.
SELF-SUPERVISED LEARNING SPEAKER VERIFICATION VOICE CONVERSION
MOSNet: Deep Learning based Objective Assessment for Voice Conversion
In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech.
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
Recently, voice conversion (VC) without parallel data has been successfully adapted to multi-target scenario in which a single model is trained to convert the input voice to many different speakers.
Multi-target Voice Conversion without Parallel Data by Adversarially Learning Disentangled Audio Representations
The decoder then takes the speaker-independent latent representation and the target speaker embedding as the input to generate the voice of the target speaker with the linguistic content of the source utterance.
Voice Conversion from Unaligned Corpora using Variational Autoencoding Wasserstein Generative Adversarial Networks
Building a voice conversion (VC) system from non-parallel speech corpora is challenging but highly valuable in real application scenarios.
Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder
We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora.
Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data
We propose Cotatron, a transcription-guided speech encoder for speaker-independent linguistic representation.
Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge
The idea is to learn a representation of speech by predicting future acoustic units.
Ranked #1 on Acoustic Unit Discovery on ZeroSpeech 2019 English
Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion
We found that the proposed encoding method offers automatic extraction of speech content from speaker style, and is sufficient to cover full linguistic content in a given language.
MelGAN-VC: Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms
We propose MelGAN-VC, a voice conversion method that relies on non-parallel speech data and is able to convert audio signals of arbitrary length from a source voice to a target voice.
StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion
To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2.
Unsupervised Representation Disentanglement using Cross Domain Features and Adversarial Learning in Variational Autoencoder based Voice Conversion
In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech.
Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders
An effective approach to non-parallel voice conversion (VC) is to utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), to model the latent structure of speech in an unsupervised manner.
Scalable Factorized Hierarchical Variational Autoencoder Training
Deep generative models have achieved great success in unsupervised learning with the ability to capture complex nonlinear relationships between latent generating factors and observations.
HYPERPARAMETER OPTIMIZATION ROBUST SPEECH RECOGNITION SPEAKER VERIFICATION VOICE CONVERSION
CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion
Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data.
Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks
A subjective evaluation showed that the quality of the converted speech was comparable to that obtained with a Gaussian mixture model-based method under advantageous conditions with parallel and twice the amount of data.
Deep Residual Neural Networks for Audio Spoofing Detection
Additionally, replay attacks where the attacker uses a speaker to replay a previously recorded genuine human speech are also possible.
ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks
We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT).
Non-Parallel Voice Conversion with Cyclic Variational Autoencoder
In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized.
VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture
Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content.
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data
We train an encoder to disentangle singer identity and singing prosody (F0 contour) from phonetic content.
F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder
Recently, AutoVC, a conditional autoencoders (CAEs) based method achieved state-of-the-art results by disentangling the speaker identity and speech content using information-constraining bottlenecks, and it achieves zero-shot conversion by swapping in a different speaker's identity embedding to synthesize a new voice.
Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech
This paper focuses on using voice conversion (VC) to improve the speech intelligibility of surgical patients who have had parts of their articulators removed.
ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder
Such situations can be avoided by introducing an auxiliary classifier and training the encoder and decoder so that the attribute classes of the decoder outputs are correctly predicted by the classifier.
Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion
Cross-lingual voice conversion (VC) is a task that aims to synthesize target voices with the same content while source and target speakers speak in different languages.
CinC-GAN for Effective F0 prediction for Whisper-to-Normal Speech Conversion
The CycleGAN-based method uses two different models, one for Mel Cepstral Coefficients (MCC) mapping, and another for F0 prediction, where F0 is highly dependent on the pre-trained model of MCC mapping.
Robust Training of Vector Quantized Bottleneck Models
We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs.
LATENT VARIABLE MODELS UNSUPERVISED REPRESENTATION LEARNING VOICE CONVERSION
Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants
The voice signal is a rich resource that discloses several possible states of a speaker, such as emotional state, confidence and stress levels, physical condition, age, gender, and personal traits.
Voice Conversion using Convolutional Neural Networks
The human auditory system is able to distinguish the vocal source of thousands of speakers, yet not much is known about what features the auditory system uses to do this.
Vocoder-free End-to-End Voice Conversion with Transformer Network
The additional pre/post processing such as MFB and vocoder is not essential to convert real human speech to others.
Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition
We evaluated this speaker adaptation approach on two low-resource corpora, namely, Ainu and Mboshi.
Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis
Our NN predicts MOS with a high correlation to human judgments.
STC Antispoofing Systems for the ASVspoof2019 Challenge
We enhanced Light CNN architecture previously considered by the authors for replay attacks detection and which performed high spoofing detection quality during the ASVspoof2017 challenge.
image quote pre code