Horovod: fixed early stopping and added metrics aggregation #3775

tgaddair · 2020-10-01T20:08:30Z

What does this PR do?

Fixes #3381 by introducing Horovod metrics aggregation alongside support for DDP. Now, whenever a metric needs to be aggregated across workers, we will check if DDP is initialized, then Horovod if it is available, before returning the original value.

This means that sync_ddp has been renamed sync_dist, and a new function sync_dist_if_available has been added to check each framework individually.

Because this change relies un functionality introduced in Horovod v0.20.0, the minimum supported Horovod version has been bumped up accordingly.

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

Borda

can we add a test for this case...

Borda · 2020-10-02T08:06:30Z

pytorch_lightning/metrics/converters.py

+    Return:
+        reduced value
+    """
+    if torch.distributed.is_available() and torch.distributed.is_initialized():


so at firistround try PT distrib and later Horovod?

At the moment, this component does not have access to the Trainer state the tells us which distributed backend we're using. A better design would be to inject this somehow, but it's not immediately clear how we could do this, since the EvalResult is created by the LightningModule.

So for now, we need to check each framework individually.

pytorch_lightning/metrics/converters.py

mergify · 2020-10-04T11:48:38Z

This pull request is now in conflict... :(

mergify · 2020-10-06T01:32:40Z

This pull request is now in conflict... :(

pep8speaks · 2020-10-06T22:37:10Z

Hello @tgaddair! Thanks for updating this PR.

In the file pytorch_lightning/metrics/metric.py:

Line 218:21: W503 line break before binary operator
Line 219:21: W503 line break before binary operator

Comment last updated at 2020-11-04 22:08:41 UTC

williamFalcon · 2020-10-06T23:00:08Z

@tgaddair mind keeping all the horovod code in the accelerator?
then add the matching function for the other accelerators that need it.

ie: we shouldn't call:

if ddp:
x()
if horovod:
y()

it should be
accelerator.x()

williamFalcon

Please use the horovod accelerator as described so we can keep the code clean and simple...
thanks!

pytorch_lightning/utilities/distributed.py

mergify · 2020-10-07T18:27:27Z

This pull request is now in conflict... :(

teddykoker · 2020-10-08T14:36:46Z

Great! If I understand correctly, each accelerator backend will now (optionally) implement gather_all_tensors which can then be passed into metrics via the dist_sync_fn flag?

tgaddair · 2020-11-04T20:19:47Z

@williamFalcon @edenlightning @teddykoker this is ready to go.

williamFalcon · 2020-11-04T20:25:48Z

@SeanNaren @tchaton mind prioritizing this to land? thanks!

tchaton

Overall, looks great ! Some minor changes and one question about your first test.

pytorch_lightning/utilities/distributed.py

pytorch_lightning/metrics/metric.py

tests/metrics/test_ddp.py

tests/models/test_horovod.py

pytorch_lightning/utilities/distributed.py

pytorch_lightning/accelerators/accelerator.py

pytorch_lightning/metrics/metric.py

tests/models/test_horovod.py

Co-authored-by: Jirka Borovec <[email protected]>

tchaton

LGTM !

SeanNaren · 2020-11-04T21:28:49Z

pinging @teddykoker to have a look since it touches the metric package (albeit not too intrusive)

teddykoker

Looks good to me! At some point we'll need a way of providing the sync function to metrics automatically

* Fixed early stopping for Horovod * Refactored to sync_dist_if_available * Bump min Horovod version to support hvd.is_initialized * Changelog * Added back change for Horovod * Removed redundant checks for initialization * Implement metrics gathering for Horovod * Added test for EvalResult * Renamed ddp_sync_on_step -> dist_sync_on_step * Added metric test for Horovod * Added option pass callable allgather function to metric base class * Added dist_sync_fn * Fixed calls to private _sync_dist * Fixed Horovod test * Added sync_tensor to the distributed backend * Skip Windows * Insert test path * Removed redundant import * Updated drone * Unset HOROVOD_GPU_ALLREDUCE * Unset * No cache dir * No uninstall * Unset variables * Uninstall Horovod during initialization * Replaced more references to ddp_sync_on_step * Fixed imports * Fixed attribute * Added back default * Lint * Added back docstring * Made gather_all_tensors default * Added whitespace * Update tests/models/test_horovod.py Co-authored-by: Jirka Borovec <[email protected]> * Update pytorch_lightning/metrics/metric.py Co-authored-by: Jirka Borovec <[email protected]> * Update CHANGELOG.md Co-authored-by: Teddy Koker <[email protected]> Co-authored-by: Sean Naren <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> (cherry picked from commit 51cc7a8)

* Fixed early stopping for Horovod * Refactored to sync_dist_if_available * Bump min Horovod version to support hvd.is_initialized * Changelog * Added back change for Horovod * Removed redundant checks for initialization * Implement metrics gathering for Horovod * Added test for EvalResult * Renamed ddp_sync_on_step -> dist_sync_on_step * Added metric test for Horovod * Added option pass callable allgather function to metric base class * Added dist_sync_fn * Fixed calls to private _sync_dist * Fixed Horovod test * Added sync_tensor to the distributed backend * Skip Windows * Insert test path * Removed redundant import * Updated drone * Unset HOROVOD_GPU_ALLREDUCE * Unset * No cache dir * No uninstall * Unset variables * Uninstall Horovod during initialization * Replaced more references to ddp_sync_on_step * Fixed imports * Fixed attribute * Added back default * Lint * Added back docstring * Made gather_all_tensors default * Added whitespace * Update tests/models/test_horovod.py Co-authored-by: Jirka Borovec <[email protected]> * Update pytorch_lightning/metrics/metric.py Co-authored-by: Jirka Borovec <[email protected]> * Update CHANGELOG.md Co-authored-by: Teddy Koker <[email protected]> Co-authored-by: Sean Naren <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>

tgaddair changed the title ~~Hvd early stop~~ Horovod: fixed early stopping and added metrics aggregation Oct 1, 2020

mergify bot requested a review from a team October 1, 2020 20:09

tgaddair mentioned this pull request Oct 1, 2020

Early stopping fails on horovod with cannot unpack non-iterable NoneType object #3381

Closed

Borda added the bug Something isn't working label Oct 2, 2020

Borda requested changes Oct 2, 2020

View reviewed changes

mergify bot requested a review from a team October 2, 2020 08:13

Borda force-pushed the hvd_early_stop branch from 1c6a89a to a21d0c6 Compare October 2, 2020 18:55

williamFalcon force-pushed the hvd_early_stop branch from a21d0c6 to 97b858c Compare October 5, 2020 15:21

tgaddair force-pushed the hvd_early_stop branch from 97b858c to d3e0e87 Compare October 6, 2020 21:56

Borda added this to the 1.0 milestone Oct 6, 2020

williamFalcon requested changes Oct 6, 2020

View reviewed changes

mergify bot requested a review from a team October 6, 2020 23:01

williamFalcon reviewed Oct 6, 2020

View reviewed changes

pytorch_lightning/utilities/distributed.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team October 6, 2020 23:04

williamFalcon reviewed Oct 6, 2020

View reviewed changes

pytorch_lightning/utilities/distributed.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team October 6, 2020 23:05

williamFalcon reviewed Oct 6, 2020

View reviewed changes

pytorch_lightning/utilities/distributed.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team October 6, 2020 23:05

edenlightning modified the milestones: 1.0, 0.10.0 Oct 6, 2020

Borda added the priority: 0 High priority task label Oct 7, 2020

tgaddair force-pushed the hvd_early_stop branch from 0a8fdae to 6450f1a Compare October 8, 2020 13:46

tgaddair added 5 commits November 3, 2020 13:01

Fixed attribute

3590051

Added back default

80532eb

Merge branch 'master' into hvd_early_stop

d7409ea

Lint

2a00e1a

Merge branch 'master' into hvd_early_stop

0fd1e9b

tchaton requested changes Nov 4, 2020

View reviewed changes

pytorch_lightning/utilities/distributed.py Show resolved Hide resolved

pytorch_lightning/metrics/metric.py Outdated Show resolved Hide resolved

tests/metrics/test_ddp.py Outdated Show resolved Hide resolved

tests/models/test_horovod.py Show resolved Hide resolved

tchaton requested changes Nov 4, 2020

View reviewed changes

pytorch_lightning/utilities/distributed.py Show resolved Hide resolved

tgaddair added 3 commits November 4, 2020 13:07

Added back docstring

603724a

Made gather_all_tensors default

bafdb9d

Added whitespace

1177a96

Borda reviewed Nov 4, 2020

View reviewed changes

pytorch_lightning/accelerators/accelerator.py Show resolved Hide resolved

pytorch_lightning/metrics/metric.py Outdated Show resolved Hide resolved

tests/models/test_horovod.py Outdated Show resolved Hide resolved

SeanNaren and others added 4 commits November 4, 2020 21:12

Merge branch 'master' into hvd_early_stop

63d8bdb

Update tests/models/test_horovod.py

2b19746

Co-authored-by: Jirka Borovec <[email protected]>

Update pytorch_lightning/metrics/metric.py

77ee13f

Co-authored-by: Jirka Borovec <[email protected]>

Update CHANGELOG.md

855fab7

Borda added distributed Generic distributed-related topic Metrics labels Nov 4, 2020

tchaton approved these changes Nov 4, 2020

View reviewed changes

SeanNaren approved these changes Nov 4, 2020

View reviewed changes

Merge branch 'master' into hvd_early_stop

8b1ee77

teddykoker approved these changes Nov 4, 2020

View reviewed changes

SeanNaren requested a review from williamFalcon November 5, 2020 10:32

williamFalcon merged commit 51cc7a8 into Lightning-AI:master Nov 5, 2020

SeanNaren mentioned this pull request Nov 11, 2020

Prevent crash if sync_dist=True on CPU #4626

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horovod: fixed early stopping and added metrics aggregation #3775

Horovod: fixed early stopping and added metrics aggregation #3775

tgaddair commented Oct 1, 2020 •

edited

Loading

Borda left a comment

Borda Oct 2, 2020

tgaddair Oct 2, 2020 •

edited

Loading

mergify bot commented Oct 4, 2020

mergify bot commented Oct 6, 2020

pep8speaks commented Oct 6, 2020 •

edited

Loading

williamFalcon commented Oct 6, 2020

williamFalcon left a comment

mergify bot commented Oct 7, 2020

teddykoker commented Oct 8, 2020

tgaddair commented Nov 4, 2020

williamFalcon commented Nov 4, 2020

tchaton left a comment •

edited

Loading

tchaton left a comment

SeanNaren commented Nov 4, 2020

teddykoker left a comment

Horovod: fixed early stopping and added metrics aggregation #3775

Horovod: fixed early stopping and added metrics aggregation #3775

Conversation

tgaddair commented Oct 1, 2020 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

Borda left a comment

Choose a reason for hiding this comment

Borda Oct 2, 2020

Choose a reason for hiding this comment

tgaddair Oct 2, 2020 • edited Loading

Choose a reason for hiding this comment

mergify bot commented Oct 4, 2020

mergify bot commented Oct 6, 2020

pep8speaks commented Oct 6, 2020 • edited Loading

Comment last updated at 2020-11-04 22:08:41 UTC

williamFalcon commented Oct 6, 2020

williamFalcon left a comment

Choose a reason for hiding this comment

mergify bot commented Oct 7, 2020

teddykoker commented Oct 8, 2020

tgaddair commented Nov 4, 2020

williamFalcon commented Nov 4, 2020

tchaton left a comment • edited Loading

Choose a reason for hiding this comment

tchaton left a comment

Choose a reason for hiding this comment

SeanNaren commented Nov 4, 2020

teddykoker left a comment

Choose a reason for hiding this comment

tgaddair commented Oct 1, 2020 •

edited

Loading

tgaddair Oct 2, 2020 •

edited

Loading

pep8speaks commented Oct 6, 2020 •

edited

Loading

tchaton left a comment •

edited

Loading