Better NaN/inf loss handling for O0 (skip step across workers) #637

redoctopus · 2020-05-14T17:19:55Z

actions.py now syncs across workers to check for NaN/inf loss. Terminates all workers if stop_on_nan_loss is set, ignores it and lets apex deal with it if amp optimization level is O1 or higher, and skips the step across workers otherwise.

Signed-off-by: Jocelyn Huang [email protected]

Signed-off-by: Jocelyn Huang <[email protected]>

nemo/backends/pytorch/actions.py

Signed-off-by: Jocelyn Huang <[email protected]>

outdated

Signed-off-by: Maciej Szulik <[email protected]>

redoctopus added 2 commits May 14, 2020 10:13

Better NaN/inf loss handling for O0 (skip step across workers)

3215a4c

Signed-off-by: Jocelyn Huang <[email protected]>

Add entry to changelog

ecfbcde

Signed-off-by: Jocelyn Huang <[email protected]>

redoctopus requested a review from okuchaiev May 15, 2020 17:04

blisc previously requested changes May 15, 2020

View reviewed changes

nemo/backends/pytorch/actions.py Outdated Show resolved Hide resolved

Change NaN/inf all_reduce check to use MAX instead of default SUM

0fe97da

Signed-off-by: Jocelyn Huang <[email protected]>

okuchaiev approved these changes May 16, 2020

View reviewed changes

redoctopus merged commit 97679e8 into NVIDIA:master May 18, 2020

redoctopus deleted the nan_inf branch May 18, 2020 18:07

redoctopus mentioned this pull request May 18, 2020

Loss is NaN or inf when finetuning Quartznet model #613

Closed

caohongnga mentioned this pull request Dec 3, 2020

[Question] how to sold Jasper ASR Loss: inf or Gradient overflow issue? #1513

Closed

dcurran90 pushed a commit to dcurran90/NeMo that referenced this pull request Oct 15, 2024

Add Containerfile to custom dictionary (NVIDIA#637)

70ea8e0

Signed-off-by: Maciej Szulik <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better NaN/inf loss handling for O0 (skip step across workers) #637

Better NaN/inf loss handling for O0 (skip step across workers) #637

redoctopus commented May 14, 2020

Better NaN/inf loss handling for O0 (skip step across workers) #637

Better NaN/inf loss handling for O0 (skip step across workers) #637

Conversation

redoctopus commented May 14, 2020