[TTS] Update TTS tutorials, Simplification of testing Mixer-TTS and F…

…astPitch (#3680) * update notebooks Signed-off-by: Oktai Tatanov <[email protected]> * small fix in FastPitch_Finetuning.ipynb Signed-off-by: Oktai Tatanov <[email protected]> * update notebooks Signed-off-by: Oktai Tatanov <[email protected]> * fix in Inference_ModelSelect.ipynb Signed-off-by: Oktai Tatanov <[email protected]> * fix librosa Signed-off-by: Oktai Tatanov <[email protected]> * fix style Signed-off-by: Oktai Tatanov <[email protected]> * update jenkinsfile, remove unnecessary line in fastpitch Signed-off-by: Oktai Tatanov <[email protected]>
NVIDIA · Feb 16, 2022 · 7231aca · 7231aca
1 parent dc2ae7f
commit 7231aca
Show file tree

Hide file tree

Showing 9 changed files with 170 additions and 99 deletions.
diff --git a/Jenkinsfile b/Jenkinsfile
@@ -2304,7 +2304,9 @@ pipeline {
             model.input_fft.n_layer=2 \
             model.output_fft.d_inner=384 \
             model.output_fft.n_layer=2 \
-            ~trainer.check_val_every_n_epoch'
+            ~trainer.check_val_every_n_epoch \
+            ~model.text_normalizer \
+            ~model.text_normalizer_call_kwargs'
           }
         }
         stage('Mixer-TTS') {
@@ -2320,7 +2322,9 @@ pipeline {
             model.train_ds.dataloader_params.num_workers=1 \
             model.validation_ds.dataloader_params.batch_size=4 \
             model.validation_ds.dataloader_params.num_workers=1 \
-            ~trainer.check_val_every_n_epoch'
+            ~trainer.check_val_every_n_epoch \
+            ~model.text_normalizer \
+            ~model.text_normalizer_call_kwargs'
           }
         }
         stage('Hifigan') {

diff --git a/nemo/collections/tts/models/fastpitch.py b/nemo/collections/tts/models/fastpitch.py
@@ -197,8 +197,6 @@ def parser(self):
     def parse(self, str_input: str, normalize=True) -> torch.tensor:
         if self.training:
             logging.warning("parse() is meant to be called in eval mode.")
-        if str_input[-1] not in [".", "!", "?"]:
-            str_input = str_input + "."
 
         if normalize and self.text_normalizer_call is not None:
             str_input = self.text_normalizer_call(str_input, **self.text_normalizer_call_kwargs)

diff --git a/nemo/collections/tts/torch/data.py b/nemo/collections/tts/torch/data.py
@@ -736,7 +736,7 @@ def __init__(
             json. Each line should contain the following:
                 "audio_filepath": <PATH_TO_WAV>,
                 "duration": <Duration of audio clip in seconds> (Optional),
-                "mel_filepath": <PATH_TO_LOG_MEL_PT> (Optional)
+                "mel_filepath": <PATH_TO_LOG_MEL> (Optional, can be in .npy (numpy.save) or .pt (torch.save) format)
             sample_rate (int): The sample rate of the audio. Or the sample rate that we will resample all files to.
             n_segments (int): The length of audio in samples to load. For example, given a sample rate of 16kHz, and
                 n_segments=16000, a random 1 second section of audio from the clip will be loaded. The section will

diff --git a/tutorials/tts/FastPitch_Finetuning.ipynb b/tutorials/tts/FastPitch_Finetuning.ipynb
diff --git a/tutorials/tts/FastPitch_MixerTTS_Training.ipynb b/tutorials/tts/FastPitch_MixerTTS_Training.ipynb
@@ -50,10 +50,10 @@
     "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
     "4. Run this cell to set up dependencies# .\n",
     "\"\"\"\n",
+    "BRANCH = 'main'\n",
     "# # If you're using Colab and not running locally, uncomment and run this cell.\n",
     "# !apt-get install sox libsndfile1 ffmpeg\n",
     "# !pip install wget unidecode\n",
-    "# BRANCH = 'main'\n",
     "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
    ]
   },
@@ -91,7 +91,7 @@
     "\n",
     "FastPitch is non-autoregressive model for mel-spectrogram generation based on FastSpeech, conditioned on fundamental frequency contours. For more details about model, please refer to the original [paper](https://arxiv.org/abs/2006.06873). NeMo re-implementation of FastPitch additionally uses unsupervised speech-text [aligner](https://arxiv.org/abs/2108.10447) which was originally implemented in [FastPitch 1.1](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch).\n",
     "\n",
-    "### MixerTTS\n",
+    "### Mixer-TTS\n",
     "\n",
     "Mixer-TTS is another non-autoregressive model for mel-spectrogram generation. It is structurally similar to FastPitch: duration prediction, pitch prediction, unsupervised TTS alignment framework, but the main difference is that Mixer-TTS is based on the [MLP-Mixer](https://arxiv.org/abs/2105.01601) architecture adapted for speech synthesis.\n",
     "\n",
@@ -226,9 +226,9 @@
     "\n",
     "# additional files\n",
     "!mkdir -p tts_dataset_files && cd tts_dataset_files \\\n",
-    "&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tts_dataset_files/cmudict-0.7b_nv22.01 \\\n",
-    "&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tts_dataset_files/heteronyms-030921 \\\n",
-    "&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/nemo_text_processing/text_normalization/en/data/whitelist_lj_speech.tsv \\\n",
+    "&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/cmudict-0.7b_nv22.01 \\\n",
+    "&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tts_dataset_files/heteronyms-030921 \\\n",
+    "&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/nemo_text_processing/text_normalization/en/data/whitelist_lj_speech.tsv \\\n",
     "&& cd .."
    ]
   },
@@ -251,10 +251,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/tts/fastpitch.py\n",
+    "!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/fastpitch.py\n",
     "\n",
     "!mkdir -p conf && cd conf \\\n",
-    "&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/tts/conf/fastpitch_align_v1.05.yaml \\\n",
+    "&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/conf/fastpitch_align_v1.05.yaml \\\n",
     "&& cd .."
    ]
   },
@@ -392,10 +392,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/tts/mixer_tts.py\n",
+    "!wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/mixer_tts.py\n",
     "\n",
     "!mkdir -p conf && cd conf \\\n",
-    "&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/examples/tts/conf/mixer-tts.yaml \\\n",
+    "&& wget https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/tts/conf/mixer-tts.yaml \\\n",
     "&& cd .."
    ]
   },
@@ -533,7 +533,7 @@
    "id": "2d9745fc",
    "metadata": {},
    "source": [
-    "### MixerTTS\n",
+    "### Mixer-TTS\n",
     "\n",
     "Now we are ready for training our model! Let's try to train Mixer-TTS.\n",
     "\n",
@@ -601,7 +601,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.12"
+   "version": "3.8.6"
   }
  },
  "nbformat": 4,

diff --git a/tutorials/tts/Inference_DurationPitchControl.ipynb b/tutorials/tts/Inference_DurationPitchControl.ipynb
@@ -46,11 +46,11 @@
     "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
     "4. Run this cell to set up dependencies.\n",
     "\"\"\"\n",
+    "BRANCH = 'main'\n",
     "# # If you're using Google Colab and not running locally, uncomment and run this cell.\n",
     "# !apt-get install sox libsndfile1 ffmpeg\n",
     "# !pip install wget unidecode\n",
-    "# BRANCH = 'main'\n",
-    "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[tts]"
+    "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
    ]
   },
   {
@@ -504,7 +504,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.10"
+   "version": "3.8.6"
   }
  },
  "nbformat": 4,

diff --git a/tutorials/tts/Inference_ModelSelect.ipynb b/tutorials/tts/Inference_ModelSelect.ipynb
@@ -46,11 +46,11 @@
     "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
     "4. Run this cell to set up dependencies.\n",
     "\"\"\"\n",
+    "BRANCH = 'main'\n",
     "# # If you're using Google Colab and not running locally, uncomment and run this cell.\n",
     "# !apt-get install sox libsndfile1 ffmpeg\n",
     "# !pip install wget unidecode\n",
-    "# BRANCH = 'main'\n",
-    "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[tts]"
+    "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
    ]
   },
   {
@@ -410,4 +410,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 4
-}
+}
diff --git a/tutorials/tts/Tacotron2_Training.ipynb b/tutorials/tts/Tacotron2_Training.ipynb
@@ -58,7 +58,7 @@
     "# # If you're using Colab and not running locally, uncomment and run this cell.\n",
     "# !apt-get install sox libsndfile1 ffmpeg\n",
     "# !pip install wget unidecode\n",
-    "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[tts]"
+    "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
    ]
   },
   {
@@ -316,7 +316,7 @@
    "provenance": []
   },
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -330,7 +330,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.5"
+   "version": "3.8.6"
   }
  },
  "nbformat": 4,

diff --git a/tutorials/tts/TalkNet_Training.ipynb b/tutorials/tts/TalkNet_Training.ipynb
@@ -50,10 +50,10 @@
     "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
     "4. Run this cell to set up dependencies# .\n",
     "\"\"\"\n",
+    "BRANCH = 'main'\n",
     "# # If you're using Colab and not running locally, uncomment and run this cell.\n",
     "# !apt-get install sox libsndfile1 ffmpeg\n",
     "# !pip install wget unidecode pysptk\n",
-    "# BRANCH = 'main'\n",
     "# !python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]"
    ]
   },
@@ -496,7 +496,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -510,7 +510,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.11"
+   "version": "3.8.6"
   }
  },
  "nbformat": 4,