Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better NaN/inf loss handling for O0 (skip step across workers) #637

Merged
merged 3 commits into from
May 18, 2020

Conversation

redoctopus
Copy link
Collaborator

actions.py now syncs across workers to check for NaN/inf loss. Terminates all workers if stop_on_nan_loss is set, ignores it and lets apex deal with it if amp optimization level is O1 or higher, and skips the step across workers otherwise.

Signed-off-by: Jocelyn Huang [email protected]

@redoctopus redoctopus requested a review from okuchaiev May 15, 2020 17:04
blisc
blisc previously requested changes May 15, 2020
nemo/backends/pytorch/actions.py Outdated Show resolved Hide resolved
@blisc blisc dismissed their stale review May 18, 2020 18:01

outdated

@redoctopus redoctopus merged commit 97679e8 into NVIDIA:master May 18, 2020
@redoctopus redoctopus deleted the nan_inf branch May 18, 2020 18:07
dcurran90 pushed a commit to dcurran90/NeMo that referenced this pull request Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants