[TTS] Clean FastPitch_Finetuning.ipynb notebook (#3698)

* clean FastPitch_Finetuning.ipynb notebook Signed-off-by: Oktai Tatanov <[email protected]> * remove unnecessary code Signed-off-by: Oktai Tatanov <[email protected]> * update README Signed-off-by: Oktai Tatanov <[email protected]>
NVIDIA · Mar 2, 2022 · c6718e8 · c6718e8
1 parent 7355909
commit c6718e8
Show file tree

Hide file tree

Showing 2 changed files with 47 additions and 58 deletions.
diff --git a/README.rst b/README.rst
@@ -74,7 +74,7 @@ Key Features
     * `NGC collection of pre-trained NLP models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_nlp>`_
 * `Speech synthesis (TTS) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tts/intro.html#>`_
     * Spectrogram generation: Tacotron2, GlowTTS, TalkNet, FastPitch, FastSpeech2, Mixer-TTS, Mixer-TTS-X
-    * Vocoders: WaveGlow, SqueezeWave, UniGlow, MelGAN, HiFiGAN
+    * Vocoders: WaveGlow, SqueezeWave, UniGlow, MelGAN, HiFiGAN, UnivNet
     * End-to-end speech generation: FastPitch_HifiGan_E2E, FastSpeech2_HifiGan_E2E
     * `NGC collection of pre-trained TTS models. <https://ngc.nvidia.com/catalog/collections/nvidia:nemo_tts>`_
 * `Tools <https://github.com/NVIDIA/NeMo/tree/main/tools>`_

diff --git a/tutorials/tts/FastPitch_Finetuning.ipynb b/tutorials/tts/FastPitch_Finetuning.ipynb
@@ -61,7 +61,7 @@
     "# # If you're using Google Colab and not running locally, uncomment and run this cell.\n",
     "# !apt-get install sox libsndfile1 ffmpeg\n",
     "# !pip install wget unidecode\n",
-    "# !python -m pip install git+https://github.com/NeMo/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
+    "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
    ]
   },
   {
@@ -212,7 +212,7 @@
    "id": "5415162b",
    "metadata": {},
    "source": [
-    "We also need some additional files (see `MixerTTS_FastPitch_Training.ipynb` tutorial for more details) for training. Let's download it too."
+    "We also need some additional files (see `FastPitch_MixerTTS_Training.ipynb` tutorial for more details) for training. Let's download it too."
    ]
   },
   {
@@ -290,7 +290,7 @@
    "source": [
     "Let's take a closer look at the training command:\n",
     "\n",
-    "* `python fastpitch_finetune.py --config-name=fastpitch_align_v1.05.yaml`\n",
+    "* `--config-name=fastpitch_align_v1.05.yaml`\n",
     "  * --config-name tells the script what config to use.\n",
     "\n",
     "* `train_dataset=./6097_manifest_train_dur_5_mins_local.json \n",
@@ -321,7 +321,7 @@
     "\n",
     "* `model.pitch_mean=121.9 model.pitch_std=23.1 model.pitch_fmin=30 model.pitch_fmax=512`\n",
     "  * For the new speaker, we need to define new pitch hyperparameters for better audio quality.\n",
-    "  * These parameters work for speaker 6097 from the HiFiTTS dataset.\n",
+    "  * These parameters work for speaker 6097 from the Hi-Fi TTS dataset.\n",
     "  * For speaker 92, we suggest `model.pitch_mean=214.5 model.pitch_std=30.9 model.pitch_fmin=80 model.pitch_fmax=512`.\n",
     "  * fmin and fmax are hyperparameters to librosa's pyin function. We recommend tweaking these per speaker.\n",
     "  * After fmin and fmax are defined, pitch mean and std can be easily extracted.\n",
@@ -384,25 +384,24 @@
    },
    "outputs": [],
    "source": [
-    "def infer(spec_gen_model, vocoder_model, str_input, speaker = None):\n",
+    "def infer(spec_gen_model, vocoder_model, str_input, speaker=None):\n",
     "    \"\"\"\n",
     "    Synthesizes spectrogram and audio from a text string given a spectrogram synthesis and vocoder model.\n",
     "    \n",
-    "    Arguments:\n",
-    "    spec_gen_model -- Instance of FastPitch model\n",
-    "    vocoder_model -- Instance of a vocoder model (HiFiGAN in our case)\n",
-    "    str_input -- Text input for the synthesis\n",
-    "    speaker -- Speaker number (in the case of a multi-speaker model -- in the mixing case)\n",
+    "    Args:\n",
+    "        spec_gen_model: Spectrogram generator model (FastPitch in our case)\n",
+    "        vocoder_model: Vocoder model (HiFiGAN in our case)\n",
+    "        str_input: Text input for the synthesis\n",
+    "        speaker: Speaker ID\n",
     "    \n",
     "    Returns:\n",
-    "    spectrogram, waveform of the synthesized audio.\n",
+    "        spectrogram and waveform of the synthesized audio.\n",
     "    \"\"\"\n",
-    "    parser_model = spec_gen_model\n",
     "    with torch.no_grad():\n",
-    "        parsed = parser_model.parse(str_input)\n",
+    "        parsed = spec_gen_model.parse(str_input)\n",
     "        if speaker is not None:\n",
-    "            speaker = torch.tensor([speaker]).long().cuda()\n",
-    "        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, speaker = speaker)\n",
+    "            speaker = torch.tensor([speaker]).long().to(device=spec_gen_model.device)\n",
+    "        spectrogram = spec_gen_model.generate_spectrogram(tokens=parsed, speaker=speaker)\n",
     "        audio = vocoder_model.convert_spectrogram_to_audio(spec=spectrogram)\n",
     "        \n",
     "    if spectrogram is not None:\n",
@@ -414,37 +413,29 @@
     "        audio = audio.to('cpu').numpy()\n",
     "    return spectrogram, audio\n",
     "\n",
-    "def get_best_ckpt(experiment_base_dir, new_speaker_id, duration_mins, mixing_enabled, original_speaker_id):\n",
-    "    \"\"\"\n",
-    "    Gives the model checkpoint paths of an experiment  we ran. \n",
+    "def get_best_ckpt_from_last_run(\n",
+    "        base_dir, \n",
+    "        new_speaker_id, \n",
+    "        duration_mins, \n",
+    "        mixing_enabled, \n",
+    "        original_speaker_id, \n",
+    "        model_name=\"FastPitch\"\n",
+    "    ):    \n",
+    "    mixing = \"no_mixing\" if not mixing_enabled else \"mixing\"\n",
     "    \n",
-    "    Arguments:\n",
-    "    experiment_base_dir -- Base experiment directory (specified on top of this notebook as exp_base_dir)\n",
-    "    new_speaker_id -- Speaker id of new HiFiTTS speaker we finetuned FastPitch on\n",
-    "    duration_mins -- total minutes of the new speaker data\n",
-    "    mixing_enabled -- True or False depending on whether we want to mix the original speaker data or not\n",
-    "    original_speaker_id -- speaker id of the original HiFiTTS speaker\n",
+    "    d = f\"{original_speaker_id}_to_{new_speaker_id}_{mixing}_{duration_mins}_mins\"\n",
     "    \n",
-    "    Returns:\n",
-    "    List of all checkpoint paths sorted by validation error, Last checkpoint path\n",
-    "    \"\"\"\n",
-    "    if not mixing_enabled:\n",
-    "        exp_dir = \"{}/{}_to_{}_no_mixing_{}_mins\".format(experiment_base_dir, original_speaker_id, new_speaker_id, duration_mins)\n",
-    "    else:\n",
-    "        exp_dir = \"{}/{}_to_{}_mixing_{}_mins\".format(experiment_base_dir, original_speaker_id, new_speaker_id, duration_mins)\n",
+    "    exp_dirs = list([i for i in (Path(base_dir) / d / model_name).iterdir() if i.is_dir()])\n",
+    "    last_exp_dir = sorted(exp_dirs)[-1]\n",
+    "    \n",
+    "    last_checkpoint_dir = last_exp_dir / \"checkpoints\"\n",
     "    \n",
-    "    ckpt_candidates = []\n",
-    "    last_ckpt = None\n",
-    "    for root, dirs, files in os.walk(exp_dir):\n",
-    "        for file in files:\n",
-    "            if file.endswith(\".ckpt\"):\n",
-    "                val_error = float(file.split(\"v_loss=\")[1].split(\"-epoch\")[0])\n",
-    "                if \"last\" in file:\n",
-    "                    last_ckpt = os.path.join(root, file)\n",
-    "                ckpt_candidates.append( (val_error, os.path.join(root, file)))\n",
-    "    ckpt_candidates.sort()\n",
+    "    last_ckpt = list(last_checkpoint_dir.glob('*-last.ckpt'))\n",
+    "\n",
+    "    if len(last_ckpt) == 0:\n",
+    "        raise ValueError(f\"There is no last checkpoint in {last_checkpoint_dir}.\")\n",
     "    \n",
-    "    return ckpt_candidates, last_ckpt"
+    "    return str(last_ckpt[0])"
    ]
   },
   {
@@ -454,7 +445,7 @@
     "id": "0153bd5a"
    },
    "source": [
-    "Specify the speaker id, duration mins and mixing variable to find the relevant checkpoint from the exp_base_dir and compare the synthesized audio with validation samples of the new speaker."
+    "Specify the speaker id, duration mins and mixing variable to find the relevant checkpoint and compare the synthesized audio with validation samples of the new speaker."
    ]
   },
   {
@@ -472,34 +463,32 @@
     "mixing = False\n",
     "original_speaker_id = \"ljspeech\"\n",
     "\n",
-    "_ ,last_ckpt = get_best_ckpt(\"./\", new_speaker_id, duration_mins, mixing, original_speaker_id)\n",
+    "last_ckpt = get_best_ckpt_from_last_run(\"./\", new_speaker_id, duration_mins, mixing, original_speaker_id)\n",
     "print(last_ckpt)\n",
     "\n",
     "spec_model = FastPitchModel.load_from_checkpoint(last_ckpt)\n",
     "spec_model.eval().cuda()\n",
-    "_speaker=None\n",
+    "\n",
+    "speaker_id = None\n",
     "if mixing:\n",
-    "    _speaker = 1\n",
+    "    speaker_id = 1\n",
     "\n",
     "num_val = 2\n",
-    "\n",
-    "manifest_path = os.path.join(\"./\", \"{}_manifest_dev_ns_all_local.json\".format(new_speaker_id))\n",
     "val_records = []\n",
-    "with open(manifest_path, \"r\") as f:\n",
+    "with open(f\"{new_speaker_id}_manifest_dev_ns_all_local.json\", \"r\") as f:\n",
     "    for i, line in enumerate(f):\n",
-    "        val_records.append( json.loads(line) )\n",
+    "        val_records.append(json.loads(line))\n",
     "        if len(val_records) >= num_val:\n",
     "            break\n",
     "            \n",
     "for val_record in val_records:\n",
-    "    print (\"Real validation audio\")\n",
+    "    print(\"Real validation audio\")\n",
     "    ipd.display(ipd.Audio(val_record['audio_filepath'], rate=22050))\n",
-    "    print (\"SYNTHESIZED FOR -- Speaker: {} | Dataset size: {} mins | Mixing:{} | Text: {}\".format(new_speaker_id, duration_mins, mixing, val_record['text']))\n",
-    "    spec, audio = infer(spec_model, vocoder, val_record['text'], speaker = _speaker)\n",
+    "    print(f\"SYNTHESIZED FOR -- Speaker: {new_speaker_id} | Dataset size: {duration_mins} mins | Mixing:{mixing} | Text: {val_record['text']}\")\n",
+    "    spec, audio = infer(spec_model, vocoder, val_record['text'], speaker=speaker_id)\n",
     "    ipd.display(ipd.Audio(audio, rate=22050))\n",
     "    %matplotlib inline\n",
-    "    #if spec is not None:\n",
-    "    imshow(spec, origin=\"lower\", aspect = \"auto\")\n",
+    "    imshow(spec, origin=\"lower\", aspect=\"auto\")\n",
     "    plt.show()"
    ]
   },
@@ -595,8 +584,8 @@
     "\n",
     "`python examples/tts/hifigan_finetune.py --config-name=hifigan.yaml model.train_ds.dataloader_params.batch_size=32 model.max_steps=1000 ~model.optim.sched model.optim.lr=0.0001 train_dataset=./hifigan_train_ft.json validation_datasets=./hifigan_val_ft.json exp_manager.exp_dir=hifigan_ft +init_from_nemo_model=tts_hifigan.nemo trainer.check_val_every_n_epoch=10 model/train_ds=train_ds_finetune model/validation_ds=val_ds_finetune`\n",
     "\n",
-    "### Improving TTS by Adding More Data\n",
-    "We can add more data in two ways. they can be combined for the best effect:\n",
+    "### Adding more data\n",
+    "We can add more data in two ways. They can be combined for the best effect:\n",
     "\n",
     "* Add more training data from the new speaker\n",
     "\n",