Skip to content

Commit

Permalink
[TTS] Clean FastPitch_Finetuning.ipynb notebook (#3698)
Browse files Browse the repository at this point in the history
* clean FastPitch_Finetuning.ipynb notebook

Signed-off-by: Oktai Tatanov <[email protected]>

* remove unnecessary code

Signed-off-by: Oktai Tatanov <[email protected]>

* update README

Signed-off-by: Oktai Tatanov <[email protected]>
  • Loading branch information
Oktai15 authored and fayejf committed Mar 2, 2022
1 parent 7355909 commit c6718e8
Show file tree
Hide file tree
Showing 2 changed files with 47 additions and 58 deletions.
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ Key Features
* `NGC collection of pre-trained NLP models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_nlp>`_
* `Speech synthesis (TTS) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tts/intro.html#>`_
* Spectrogram generation: Tacotron2, GlowTTS, TalkNet, FastPitch, FastSpeech2, Mixer-TTS, Mixer-TTS-X
* Vocoders: WaveGlow, SqueezeWave, UniGlow, MelGAN, HiFiGAN
* Vocoders: WaveGlow, SqueezeWave, UniGlow, MelGAN, HiFiGAN, UnivNet
* End-to-end speech generation: FastPitch_HifiGan_E2E, FastSpeech2_HifiGan_E2E
* `NGC collection of pre-trained TTS models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_tts>`_
* `Tools <https://github.com/NVIDIA/NeMo/tree/main/tools>`_
Expand Down
103 changes: 46 additions & 57 deletions tutorials/tts/FastPitch_Finetuning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
"# # If you're using Google Colab and not running locally, uncomment and run this cell.\n",
"# !apt-get install sox libsndfile1 ffmpeg\n",
"# !pip install wget unidecode\n",
"# !python -m pip install git+https://github.com/NeMo/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
"# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
]
},
{
Expand Down Expand Up @@ -212,7 +212,7 @@
"id": "5415162b",
"metadata": {},
"source": [
"We also need some additional files (see `MixerTTS_FastPitch_Training.ipynb` tutorial for more details) for training. Let's download it too."
"We also need some additional files (see `FastPitch_MixerTTS_Training.ipynb` tutorial for more details) for training. Let's download it too."
]
},
{
Expand Down Expand Up @@ -290,7 +290,7 @@
"source": [
"Let's take a closer look at the training command:\n",
"\n",
"* `python fastpitch_finetune.py --config-name=fastpitch_align_v1.05.yaml`\n",
"* `--config-name=fastpitch_align_v1.05.yaml`\n",
" * --config-name tells the script what config to use.\n",
"\n",
"* `train_dataset=./6097_manifest_train_dur_5_mins_local.json \n",
Expand Down Expand Up @@ -321,7 +321,7 @@
"\n",
"* `model.pitch_mean=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512`\n",
" * For the new speaker, we need to define new pitch hyperparameters for better audio quality.\n",
" * These parameters work for speaker 6097 from the HiFiTTS dataset.\n",
" * These parameters work for speaker 6097 from the Hi-Fi TTS dataset.\n",
" * For speaker 92, we suggest `model.pitch_mean=214.5 model.pitch_std=30.9 model.pitch_fmin=80 model.pitch_fmax=512`.\n",
" * fmin and fmax are hyperparameters to librosa's pyin function. We recommend tweaking these per speaker.\n",
" * After fmin and fmax are defined, pitch mean and std can be easily extracted.\n",
Expand Down Expand Up @@ -384,25 +384,24 @@
},
"outputs": [],
"source": [
"def infer(spec_gen_model, vocoder_model, str_input, speaker = None):\n",
"def infer(spec_gen_model, vocoder_model, str_input, speaker=None):\n",
" \"\"\"\n",
" Synthesizes spectrogram and audio from a text string given a spectrogram synthesis and vocoder model.\n",
" \n",
" Arguments:\n",
" spec_gen_model -- Instance of FastPitch model\n",
" vocoder_model -- Instance of a vocoder model (HiFiGAN in our case)\n",
" str_input -- Text input for the synthesis\n",
" speaker -- Speaker number (in the case of a multi-speaker model -- in the mixing case)\n",
" Args:\n",
" spec_gen_model: Spectrogram generator model (FastPitch in our case)\n",
" vocoder_model: Vocoder model (HiFiGAN in our case)\n",
" str_input: Text input for the synthesis\n",
" speaker: Speaker ID\n",
" \n",
" Returns:\n",
" spectrogram, waveform of the synthesized audio.\n",
" spectrogram and waveform of the synthesized audio.\n",
" \"\"\"\n",
" parser_model = spec_gen_model\n",
" with torch.no_grad():\n",
" parsed = parser_model.parse(str_input)\n",
" parsed = spec_gen_model.parse(str_input)\n",
" if speaker is not None:\n",
" speaker = torch.tensor([speaker]).long().cuda()\n",
" spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, speaker = speaker)\n",
" speaker = torch.tensor([speaker]).long().to(device=spec_gen_model.device)\n",
" spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, speaker=speaker)\n",
" audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)\n",
" \n",
" if spectrogram is not None:\n",
Expand All @@ -414,37 +413,29 @@
" audio = audio.to('cpu').numpy()\n",
" return spectrogram, audio\n",
"\n",
"def get_best_ckpt(experiment_base_dir, new_speaker_id, duration_mins, mixing_enabled, original_speaker_id):\n",
" \"\"\"\n",
" Gives the model checkpoint paths of an experiment we ran. \n",
"def get_best_ckpt_from_last_run(\n",
" base_dir, \n",
" new_speaker_id, \n",
" duration_mins, \n",
" mixing_enabled, \n",
" original_speaker_id, \n",
" model_name=\"FastPitch\"\n",
" ): \n",
" mixing = \"no_mixing\" if not mixing_enabled else \"mixing\"\n",
" \n",
" Arguments:\n",
" experiment_base_dir -- Base experiment directory (specified on top of this notebook as exp_base_dir)\n",
" new_speaker_id -- Speaker id of new HiFiTTS speaker we finetuned FastPitch on\n",
" duration_mins -- total minutes of the new speaker data\n",
" mixing_enabled -- True or False depending on whether we want to mix the original speaker data or not\n",
" original_speaker_id -- speaker id of the original HiFiTTS speaker\n",
" d = f\"{original_speaker_id}_to_{new_speaker_id}_{mixing}_{duration_mins}_mins\"\n",
" \n",
" Returns:\n",
" List of all checkpoint paths sorted by validation error, Last checkpoint path\n",
" \"\"\"\n",
" if not mixing_enabled:\n",
" exp_dir = \"{}/{}_to_{}_no_mixing_{}_mins\".format(experiment_base_dir, original_speaker_id, new_speaker_id, duration_mins)\n",
" else:\n",
" exp_dir = \"{}/{}_to_{}_mixing_{}_mins\".format(experiment_base_dir, original_speaker_id, new_speaker_id, duration_mins)\n",
" exp_dirs = list([i for i in (Path(base_dir) / d / model_name).iterdir() if i.is_dir()])\n",
" last_exp_dir = sorted(exp_dirs)[-1]\n",
" \n",
" last_checkpoint_dir = last_exp_dir / \"checkpoints\"\n",
" \n",
" ckpt_candidates = []\n",
" last_ckpt = None\n",
" for root, dirs, files in os.walk(exp_dir):\n",
" for file in files:\n",
" if file.endswith(\".ckpt\"):\n",
" val_error = float(file.split(\"v_loss=\")[1].split(\"-epoch\")[0])\n",
" if \"last\" in file:\n",
" last_ckpt = os.path.join(root, file)\n",
" ckpt_candidates.append( (val_error, os.path.join(root, file)))\n",
" ckpt_candidates.sort()\n",
" last_ckpt = list(last_checkpoint_dir.glob('*-last.ckpt'))\n",
"\n",
" if len(last_ckpt) == 0:\n",
" raise ValueError(f\"There is no last checkpoint in {last_checkpoint_dir}.\")\n",
" \n",
" return ckpt_candidates, last_ckpt"
" return str(last_ckpt[0])"
]
},
{
Expand All @@ -454,7 +445,7 @@
"id": "0153bd5a"
},
"source": [
"Specify the speaker id, duration mins and mixing variable to find the relevant checkpoint from the exp_base_dir and compare the synthesized audio with validation samples of the new speaker."
"Specify the speaker id, duration mins and mixing variable to find the relevant checkpoint and compare the synthesized audio with validation samples of the new speaker."
]
},
{
Expand All @@ -472,34 +463,32 @@
"mixing = False\n",
"original_speaker_id = \"ljspeech\"\n",
"\n",
"_ ,last_ckpt = get_best_ckpt(\"./\", new_speaker_id, duration_mins, mixing, original_speaker_id)\n",
"last_ckpt = get_best_ckpt_from_last_run(\"./\", new_speaker_id, duration_mins, mixing, original_speaker_id)\n",
"print(last_ckpt)\n",
"\n",
"spec_model = FastPitchModel.load_from_checkpoint(last_ckpt)\n",
"spec_model.eval().cuda()\n",
"_speaker=None\n",
"\n",
"speaker_id = None\n",
"if mixing:\n",
" _speaker = 1\n",
" speaker_id = 1\n",
"\n",
"num_val = 2\n",
"\n",
"manifest_path = os.path.join(\"./\", \"{}_manifest_dev_ns_all_local.json\".format(new_speaker_id))\n",
"val_records = []\n",
"with open(manifest_path, \"r\") as f:\n",
"with open(f\"{new_speaker_id}_manifest_dev_ns_all_local.json\", \"r\") as f:\n",
" for i, line in enumerate(f):\n",
" val_records.append( json.loads(line) )\n",
" val_records.append(json.loads(line))\n",
" if len(val_records) >= num_val:\n",
" break\n",
" \n",
"for val_record in val_records:\n",
" print (\"Real validation audio\")\n",
" print(\"Real validation audio\")\n",
" ipd.display(ipd.Audio(val_record['audio_filepath'], rate=22050))\n",
" print (\"SYNTHESIZED FOR -- Speaker: {} | Dataset size: {} mins | Mixing:{} | Text: {}\".format(new_speaker_id, duration_mins, mixing, val_record['text']))\n",
" spec, audio = infer(spec_model, vocoder, val_record['text'], speaker = _speaker)\n",
" print(f\"SYNTHESIZED FOR -- Speaker: {new_speaker_id} | Dataset size: {duration_mins} mins | Mixing:{mixing} | Text: {val_record['text']}\")\n",
" spec, audio = infer(spec_model, vocoder, val_record['text'], speaker=speaker_id)\n",
" ipd.display(ipd.Audio(audio, rate=22050))\n",
" %matplotlib inline\n",
" #if spec is not None:\n",
" imshow(spec, origin=\"lower\", aspect = \"auto\")\n",
" imshow(spec, origin=\"lower\", aspect=\"auto\")\n",
" plt.show()"
]
},
Expand Down Expand Up @@ -595,8 +584,8 @@
"\n",
"`python examples/tts/hifigan_finetune.py --config-name=hifigan.yaml model.train_ds.dataloader_params.batch_size=32 model.max_steps=1000 ~model.optim.sched model.optim.lr=0.0001 train_dataset=./hifigan_train_ft.json validation_datasets=./hifigan_val_ft.json exp_manager.exp_dir=hifigan_ft +init_from_nemo_model=tts_hifigan.nemo trainer.check_val_every_n_epoch=10 model/train_ds=train_ds_finetune model/validation_ds=val_ds_finetune`\n",
"\n",
"### Improving TTS by Adding More Data\n",
"We can add more data in two ways. they can be combined for the best effect:\n",
"### Adding more data\n",
"We can add more data in two ways. They can be combined for the best effect:\n",
"\n",
"* Add more training data from the new speaker\n",
"\n",
Expand Down

0 comments on commit c6718e8

Please sign in to comment.