[Fix] Hanging for Fully Randomized Bucketing (#4348) · titu1994/NeMo@4c11d61

Commit

[Fix] Hanging for Fully Randomized Bucketing (NVIDIA#4348)

* Update container to 22.05 (NVIDIA#4329)

* update container to 22.05

Signed-off-by: ericharper <[email protected]>

* try adding safe directory

Signed-off-by: ericharper <[email protected]>

* try env var

Signed-off-by: ericharper <[email protected]>

* printenv

Signed-off-by: ericharper <[email protected]>

* try GIT_BRANCH

Signed-off-by: ericharper <[email protected]>

* typo

Signed-off-by: ericharper <[email protected]>

* remove dbug statements

Signed-off-by: ericharper <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>

* Merge r1.9.0 main (NVIDIA#4331)

* update branch

Signed-off-by: ericharper <[email protected]>

* update package info

Signed-off-by: ericharper <[email protected]>

* cleaned up TN/ ITN doc (NVIDIA#4119)

* cleaned up TN/ ITN doc

Signed-off-by: Yang Zhang <[email protected]>

* fix typo

Signed-off-by: Yang Zhang <[email protected]>

* fix image

Signed-off-by: Yang Zhang <[email protected]>

* fix image

Signed-off-by: Yang Zhang <[email protected]>

* Draft: Fix restoring from checkpoint for case when `model.common_dataset_parameters.label_vocab_dir` is provided (NVIDIA#4136)

* Fix restoring from checkpoint with label vocab dir

Signed-off-by: PeganovAnton <[email protected]>

* Add tests for various ways to pass label ids to model

Signed-off-by: PeganovAnton <[email protected]>

* Fix typo

Signed-off-by: PeganovAnton <[email protected]>

* Fix typo

Signed-off-by: PeganovAnton <[email protected]>

* Do not create tmp directory

Signed-off-by: PeganovAnton <[email protected]>

* Fix parameter name

Signed-off-by: PeganovAnton <[email protected]>

* finish cherry-pick op

Signed-off-by: PeganovAnton <[email protected]>

* Fix labels errors

Signed-off-by: PeganovAnton <[email protected]>

* Remove duplicate stage

Signed-off-by: PeganovAnton <[email protected]>

* Change target branch

Signed-off-by: PeganovAnton <[email protected]>

* fix doc (NVIDIA#4146)

Signed-off-by: Yang Zhang <[email protected]>

* Tacotron2 retrain (NVIDIA#4103)

* fix yaml

Signed-off-by: treacker <[email protected]>

* Fix for new TTSDataset class

Signed-off-by: treacker <[email protected]>

* added wandb logging

Signed-off-by: treacker <[email protected]>

* added wandb logging

Signed-off-by: treacker <[email protected]>

* fix numpy version

Signed-off-by: treacker <[email protected]>

* fix numpy version

Signed-off-by: treacker <[email protected]>

* inference fix

Signed-off-by: treacker <[email protected]>

* removed old code

Signed-off-by: treacker <[email protected]>

* updated parser logic

Signed-off-by: treacker <[email protected]>

* reverted version update

Signed-off-by: treacker <[email protected]>

* refactored parser logic

Signed-off-by: treacker <[email protected]>

* Updated Jenkinsfile

Signed-off-by: treacker <[email protected]>

* Refactored tutorial for Tacotron2

Signed-off-by: treacker <[email protected]>

* Made backward compatibility

Signed-off-by: treacker <[email protected]>

* Made backward compatibility

Signed-off-by: treacker <[email protected]>

* Update Jenkinsfile

Signed-off-by: treacker <[email protected]>

* Update tacotron.yaml

Signed-off-by: treacker <[email protected]>

* Refactoring

Signed-off-by: treacker <[email protected]>

* cleaned up TN/ ITN doc (NVIDIA#4119)

* cleaned up TN/ ITN doc

Signed-off-by: Yang Zhang <[email protected]>

* fix typo

Signed-off-by: Yang Zhang <[email protected]>

* fix image

Signed-off-by: Yang Zhang <[email protected]>

* fix image

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: treacker <[email protected]>

* Check implicit grad acc in GLUE dataset building (NVIDIA#4123)

* Check implicit grad acc in GLUE dataset building

Signed-off-by: MaximumEntropy <[email protected]>

* Fix jenkins test for GLUE/XNLI

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: treacker <[email protected]>

* Refactoring

Signed-off-by: treacker <[email protected]>

* Refactoring

Signed-off-by: treacker <[email protected]>

* Fixed jenkins

Signed-off-by: treacker <[email protected]>

* Refactoring

Signed-off-by: treacker <[email protected]>

* Refactoring

Signed-off-by: treacker <[email protected]>

* Refactoring

Signed-off-by: treacker <[email protected]>

Co-authored-by: Yang Zhang <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>

* Multiprocess improvements (NVIDIA#4127)

* initial commit

Signed-off-by: nithinraok <[email protected]>

* start fix

Signed-off-by: nithinraok <[email protected]>

* improve multiprocessing speed while creating speaker dataset

Signed-off-by: nithinraok <[email protected]>

* updated scp to filelist

Signed-off-by: nithinraok <[email protected]>

* notebooks' link, typo and import  fix  (NVIDIA#4158)

* redo missing pr 4007

Signed-off-by: fayejf <[email protected]>

* remove extremely unreliable links

Signed-off-by: fayejf <[email protected]>

* update speaker docs (NVIDIA#4164)

* update speaker docs

Signed-off-by: nithinraok <[email protected]>

* chunks -> segments

Signed-off-by: nithinraok <[email protected]>

* Khz -> kHz

Signed-off-by: nithinraok <[email protected]>

* small fix (NVIDIA#4180)

Signed-off-by: fayejf <[email protected]>

* fix the server key value problem (NVIDIA#4196)

Signed-off-by: Yi Dong <[email protected]>

* Fix/punctuation/trainer required for setting test data (NVIDIA#4199)

* Draft of fix

Signed-off-by: PeganovAnton <[email protected]>

* Add warnings and replace globa_step with current_epoch

Signed-off-by: PeganovAnton <[email protected]>

* Small improvements to warnings

Signed-off-by: PeganovAnton <[email protected]>

* Error and warning messages improvements

Signed-off-by: PeganovAnton <[email protected]>

* Replace self.trainer with self._trainer

Signed-off-by: PeganovAnton <[email protected]>

* Update ContextNet version (NVIDIA#4207)

Signed-off-by: smajumdar <[email protected]>

* fix bugs for dialogue tutorial (NVIDIA#4211)

Signed-off-by: Zhilin Wang <[email protected]>

* Dialogue tutorial fix (NVIDIA#4214)

* fix bugs for dialogue tutorial

Signed-off-by: Zhilin Wang <[email protected]>

* update path for convert_datasets.py due to conflict PR

Signed-off-by: Zhilin Wang <[email protected]>

* Add docs for Thutmose Tagger (NVIDIA#4173)

* Add docs for Thutmose Tagger

Signed-off-by: Alexandra Antonova <[email protected]>

* add level in docs

Signed-off-by: Alexandra Antonova <[email protected]>

* delete folder to avoid error with running when folder exists from previous run

Signed-off-by: Alexandra Antonova <[email protected]>

Co-authored-by: Alexandra Antonova <[email protected]>
Co-authored-by: ekmb <[email protected]>

* Dialogue tutorial fix (NVIDIA#4218)

* fix bugs for dialogue tutorial

Signed-off-by: Zhilin Wang <[email protected]>

* update path for convert_datasets.py due to conflict PR

Signed-off-by: Zhilin Wang <[email protected]>

* restore previously deleted files

Signed-off-by: Zhilin Wang <[email protected]>

* style fix

Signed-off-by: Zhilin Wang <[email protected]>

* Dialogue tutorial fix (NVIDIA#4221)

* fix bugs for dialogue tutorial

Signed-off-by: Zhilin Wang <[email protected]>

* update path for convert_datasets.py due to conflict PR

Signed-off-by: Zhilin Wang <[email protected]>

* restore previously deleted files

Signed-off-by: Zhilin Wang <[email protected]>

* style fix

Signed-off-by: Zhilin Wang <[email protected]>

* update tutorial

Signed-off-by: Zhilin Wang <[email protected]>

* fix syntax error in ipynb-file (NVIDIA#4228)

Signed-off-by: Alexandra Antonova <[email protected]>

Co-authored-by: Alexandra Antonova <[email protected]>

* fix json serialize (NVIDIA#4235)

Signed-off-by: Yi Dong <[email protected]>

* Prompt Learning Typo Fixes (NVIDIA#4238)

* Prompt tuning notebook typo fixes

Signed-off-by: Virginia Adams <[email protected]>

* Update tutorials.rst

* Update prompt_learning.rst

* Update prompt_learning.rst

* fixing bug 3642622 (NVIDIA#4250)

* fixing bug 3642622

Signed-off-by: Ghasem Pasandi <[email protected]>

* fixing bug 3642622

Signed-off-by: Ghasem Pasandi <[email protected]>

Co-authored-by: Ghasem Pasandi <[email protected]>

* fix broken link in the tutorial (NVIDIA#4257)

Signed-off-by: Alexandra Antonova <[email protected]>

Co-authored-by: Alexandra Antonova <[email protected]>

* Typo fix, branch change, better download messagae (NVIDIA#4262)

Signed-off-by: Virginia Adams <[email protected]>

* Raise error if bicleaner is not installed in NMT Data preprocesing notebook (NVIDIA#4264)

* Raise error if bicleaner is not installed

Signed-off-by: MaximumEntropy <[email protected]>

* Clear cells

Signed-off-by: MaximumEntropy <[email protected]>

* Fix missing validation dataset, whitelist certain keywords for datasets (NVIDIA#4269)

* Fix missing validation dataset, whitelist certain keywords for datasets

Signed-off-by: smajumdar <[email protected]>

* Fix missing validation dataset, whitelist certain keywords for datasets

Signed-off-by: smajumdar <[email protected]>

* Update asr configs with num_workers and pin_memory (NVIDIA#4270)

Signed-off-by: smajumdar <[email protected]>

* Fix epoch end (NVIDIA#4265)

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Eric Harper <[email protected]>

* Set Save on train end to false (NVIDIA#4274)

* Set Save on train end to false

Signed-off-by: Virginia Adams <[email protected]>

* Update prompt_learning.rst

* Update prompt_learning.rst

* Update YAML (NVIDIA#4261)

Signed-off-by: MaximumEntropy <[email protected]>

* Updated config to fix CI test OOM error (NVIDIA#4279)

* Updated config to fix CI test issue

Signed-off-by: Virginia Adams <[email protected]>

* Increased num workers

Signed-off-by: Virginia Adams <[email protected]>

* verbose k2 install, skip if failed (NVIDIA#4289)

Signed-off-by: Aleksandr Laptev <[email protected]>

Co-authored-by: Aleksandr Laptev <[email protected]>

* Changed total virtual prompt tokens (NVIDIA#4295)

* Changed total virtual prompt tokens

Signed-off-by: Virginia Adams <[email protected]>

* put number of workers back

Signed-off-by: Virginia Adams <[email protected]>

* upper bound lightning

Signed-off-by: ericharper <[email protected]>

* update branch

Signed-off-by: ericharper <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* remove duplicate test

Signed-off-by: ericharper <[email protected]>

* fix tn test cases

Signed-off-by: ericharper <[email protected]>

* add another safe.directory

Signed-off-by: ericharper <[email protected]>

* typo

Signed-off-by: ericharper <[email protected]>

Co-authored-by: Yang Zhang <[email protected]>
Co-authored-by: PeganovAnton <[email protected]>
Co-authored-by: treacker <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: fayejf <[email protected]>
Co-authored-by: Yi Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Zhilin Wang <[email protected]>
Co-authored-by: bene-ges <[email protected]>
Co-authored-by: Alexandra Antonova <[email protected]>
Co-authored-by: ekmb <[email protected]>
Co-authored-by: Virginia Adams <[email protected]>
Co-authored-by: Ghasem <[email protected]>
Co-authored-by: Ghasem Pasandi <[email protected]>
Co-authored-by: Aleksandr Laptev <[email protected]>
Co-authored-by: Aleksandr Laptev <[email protected]>
Signed-off-by: stevehuang52 <[email protected]>

* fix full_randn bucket hang

Signed-off-by: stevehuang52 <[email protected]>

* remove unused variables

Signed-off-by: stevehuang52 <[email protected]>

Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Yang Zhang <[email protected]>
Co-authored-by: PeganovAnton <[email protected]>
Co-authored-by: treacker <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: fayejf <[email protected]>
Co-authored-by: Yi Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: Zhilin Wang <[email protected]>
Co-authored-by: bene-ges <[email protected]>
Co-authored-by: Alexandra Antonova <[email protected]>
Co-authored-by: ekmb <[email protected]>
Co-authored-by: Virginia Adams <[email protected]>
Co-authored-by: Ghasem <[email protected]>
Co-authored-by: Ghasem Pasandi <[email protected]>
Co-authored-by: Aleksandr Laptev <[email protected]>
Co-authored-by: Aleksandr Laptev <[email protected]>

Loading branch information

19 people committed Jun 21, 2022

1 parent a64b469 commit 4c11d61

nemo/collections/asr/modules/rnnt.py

-Original file line number
+Diff line change
@@ Expand Up / @@ -845,8 +845,6 @@ def forward( @@
                     )
                 losses = []
-                wer_numer_list = []
-                wer_denom_list = []
                 batch_size = int(encoder_outputs.size(0))  # actual batch size
                 # Iterate over batch using fused_batch_size steps
@@ Expand Down Expand Up / @@ -914,31 +912,14 @@ def forward( @@
                     else:
                         losses = None
-                    # Compute WER for sub batch
+                    # Update WER for sub batch
                     if compute_wer:
                         sub_enc = sub_enc.transpose(1, 2)  # [B, T, D] -> [B, D, T]
                         sub_enc = sub_enc.detach()
                         sub_transcripts = sub_transcripts.detach()
-                        original_log_prediction = self.wer.log_prediction
-                        if original_log_prediction and batch_idx == 0:
-                            self.wer.log_prediction = True
-                        else:
-                            self.wer.log_prediction = False
-                        # Compute the wer (with logging for just 1st sub-batch)
+                        # Update WER on each process without syncing
                         self.wer.update(sub_enc, sub_enc_lens, sub_transcripts, sub_transcript_lens)
-                        wer, wer_num, wer_denom = self.wer.compute()
-                        self.wer.reset()
-                        wer_numer_list.append(wer_num)
-                        wer_denom_list.append(wer_denom)
-                        # Reset logging default
-                        self.wer.log_prediction = original_log_prediction
-                    else:
-                        wer = None
                     del sub_enc, sub_transcripts, sub_enc_lens, sub_transcript_lens
@@ Expand All / @@ -951,12 +932,11 @@ def forward( @@
                 # Collect sub batch wer results
                 if compute_wer:
-                    wer_num = torch.tensor(wer_numer_list, dtype=torch.long)
-                    wer_denom = torch.tensor(wer_denom_list, dtype=torch.long)
-                    wer_num = wer_num.sum()  # global sum of correct words/chars
-                    wer_denom = wer_denom.sum()  # global sum of all words/chars
+                    # Sync and all_reduce on all processes, compute global WER
+                    wer, wer_num, wer_denom = self.wer.compute()
+                    self.wer.reset()
                 else:
+                    wer = None
                     wer_num = None
                     wer_denom = None
@@ Expand Down @@

0 comments on commit `4c11d61`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `4c11d61`

Commit

There are no files selected for viewing

0 comments on commit 4c11d61

0 comments on commit `4c11d61`