Paper link: https://ieeexplore.ieee.org/document/10040000
Authors
Md Tousin Akhter • Padmanabha Banerjee • Sandipan Dhar • Nanda Dulal Jana
Voice Conversion (VC) has been a prominent subject of research in the science of speech synthesis in recent years as a result of its growing use in voice-assistance technology, automated movie dubbing, and speech-to-singing conversion, to mention a few applications. VC is concerned with the conversion of one speaker's vocal style to that of another while maintaining the linguistic content. The speech analysis, speech feature mapping, and speech reconstruction stages of the VC job are accomplished in a three-stage pipeline. For voice feature mapping from source to target speaker, the Generative Adversarial Network (GAN) models are now commonly used. Evolution of models will enhance the quality of synthesized speech and also will enable real world application of VC commercially and also in medical field. To ensure that the reproduced speech holds the characteristics of the voice and error are minimized to produce the finest quality , we need evaluation metrics to govern the quality. Thus the presented models can elegantly execute the voice conversion task by attaining high speaker similarity and appropriate speech quality, according to subjective and objective evaluations of the generated speech samples.
The visual evaluation metric plots are available here.
The following files represent the various objecvtive and subjective evaluation metrics that can be used in Voice Conversion.
- Mel-Cepstral Distortion (MCD)
- F0 Root Mean Square Error (F0 RMSE)
- log F0 Root Mean Square Error (log F0 RMSE)
- Modulation Spectra Distance (MSD)
- Speech to Noise Ratio (SNR)
- Perceptual Evaluation of Speech Quality (PESQ)
- Global Variance (GV)
- MCEP Trajectory (link)
- Modulation Spectrum (link)
- Mean MCEP (link)
- MCEP Scatter Plot (link)
Each notebook allows the user to add the dataset and select the directory which contains the audio (.wav) files of their corresponding speech classes. Then as user input the directory path for the original and the generated speech audio files are given , along with the specifications like the markers and labels. In particular evaluation metrics , the dimensions that are to be visualized are also taken as user input.
One can either use the google_drive_downloader library to load the zip files of the dataset into the workspace or you can directly load it by mounting your google drive and accessing it through Colab.