Merge branch 'main' into t5_lm_adaptation

NVIDIA · Feb 14, 2022 · 9d4c3aa · 9d4c3aa
2 parents 007b9e2 + aeeb0d2
commit 9d4c3aa
Show file tree

Hide file tree

Showing 4 changed files with 18 additions and 0 deletions.
diff --git a/README.rst b/README.rst
@@ -49,6 +49,7 @@ Key Features
         * Supports CTC and Transducer/RNNT losses/decoders
         * Beam Search decoding
         * `Language Modelling for ASR <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html>`_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
+        * Streaming and Buffered ASR (CTC/Transdcer) - `Chunked Inference Examples <https://github.com/NVIDIA/NeMo/tree/main/examples/asr/asr_chunked_inference>`_
     * `Speech Classification and Speech Command Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html>`_: MatchboxNet (Command Recognition)
     * `Voice activity Detection (VAD) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html#marblenet-vad>`_: MarbleNet
     * `Speaker Recognition <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/intro.html>`_: TitaNet, ECAPA_TDNN, SpeakerNet

diff --git a/docs/source/starthere/tutorials.rst b/docs/source/starthere/tutorials.rst
@@ -82,6 +82,12 @@ To run a tutorial:
    * - ASR
      - Streaming inference for ASR
      - `Streaming inference <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Streaming_ASR.ipynb>`_
+   * - ASR
+     - Buffered Transducer inference for ASR
+     - `Buffered Transducer inference <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Buffered_Transducer_Inference.ipynb>`_
+   * - ASR
+     - Buffered Transducer inference with LCS Merge Algorithm
+     - `Buffered Transducer inference with LCS Merge <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Buffered_Transducer_Inference_with_LCS_Merge.ipynb>`_
    * - ASR
      - Self-supervised pre-training for ASR
      - `Self-supervised Pre-training for ASR <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/asr/Self_Supervised_Pre_Training.ipynb>`_

diff --git a/examples/asr/asr_chunked_inference/README.md b/examples/asr/asr_chunked_inference/README.md
@@ -0,0 +1,11 @@
+# Streaming / Buffered ASR
+
+Contained within this directory are scripts to perform streaming or buffered inference of audio files using CTC / Transducer ASR models.
+
+## Difference between streaming and buffered ASR
+
+While we primarily showcase the defalts of these models in buffering mode, note that the major difference between streaming ASR and buffered ASR is the chunk size and the total context buffer size.
+
+If you reduce your chunk size, the latency for your first prediction is reduced, and the model appears to predict the text with shorter delay. On the other hand, since the amount of information in the chunk is reduced, it causes higher WER.
+
+On the other hand, if you increase your chunk size, then the delay between spoken sentence and the transcription increases (this is buffered ASR). While the latency is increased, you are able to obtain more accurate transcripts since the model has more context to properly transcribe the text.
diff --git a/...ence/ctc/speech_to_text_buffered_infer.py → .../ctc/speech_to_text_buffered_infer_ctc.py b/...ence/ctc/speech_to_text_buffered_infer.py → .../ctc/speech_to_text_buffered_infer_ctc.py