https://github.com/magenta/ddsp
https://github.com/YatingMusic/ddsp-singing-vocoders
We recommend first installing PyTorch from the official website, then run:
pip install -r requirements.txt
Put all the training dataset (.wav format audio clips) in the below directory:
data/train/audio
.
Put all the validation dataset (.wav format audio clips) in the below directory:
data/val/audio
.
Then run
python preprocess.py -c configs/full.yaml
for a model of hybrid additive synthesis and subtractive synthesis, or run
python preprocess.py -c configs/sins.yaml
for a model of additive synthesis only, or run
python preprocess.py -c configs/sawsub.yaml
for a model of substractive synthesis only.
You can modify the configuration file config/<model_name>.yaml
before preprocessing. The default configuration during training is 44.1khz sampling rate audio for about a few hours and GTX1660 graphics card.
# train a full model as an example
python train.py -c configs/full.yaml
The command line for training other models is similar.
You can safely interrupt training, then running the same command line will resume training.
You can also finetune the model if you interrupt training first, then re-preprocess the new dataset or change the training parameters (batchsize, lr etc.) and then run the same command line.
# check the training status using tensorboard
tensorboard --logdir=exp
# Copy-synthesising test
# wav -> mel, f0 -> wav
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <keychange (semitones)>
# Pitch-shifting test
# wav -> mel, f0 -> mel (unchaned), f0 (shifted) -> wav
python main.py -i <input.wav> -m <model_file.pt> -o <output.wav> -k <key(semitones)>
It is recommended to try the "Full" model first, which generally has a low multi-scaled-stft loss and relatively good quality when applying a pitch shift.
However, this loss sometimes cannot reflect the subjective sense of hearing.
If the "Full" model does not work well, it is recommended to switch to the "Sins" model.
The "Sins" model works also well when applying copy synthesis, but it changes the formant when applying a pitch shift, which changes the timbre.
The "SawSub" model is not recommended due to artifacts in unvoiced phonemes, although it probably has the best formant invariance in pitch-shifting cases.
For the seen speaker, the sound quality of a well-trained ddsp vocoder will be better than that of the world vocoder or griffin-lim vocoder, and it can also compete with the gan-based vocoder when the total amount of data is relatively small. But for a large amount of data, the upper limit of sound quality will be lower than that of generative model-based vocoders.
For the unseen speaker, the performance may be unsatisfactory.