-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horovod: fixed early stopping and added metrics aggregation #3775
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a test for this case...
Return: | ||
reduced value | ||
""" | ||
if torch.distributed.is_available() and torch.distributed.is_initialized(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so at firistround try PT distrib and later Horovod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, this component does not have access to the Trainer state the tells us which distributed backend we're using. A better design would be to inject this somehow, but it's not immediately clear how we could do this, since the EvalResult is created by the LightningModule.
So for now, we need to check each framework individually.
This pull request is now in conflict... :( |
a21d0c6
to
97b858c
Compare
This pull request is now in conflict... :( |
97b858c
to
d3e0e87
Compare
Hello @tgaddair! Thanks for updating this PR.
Comment last updated at 2020-11-04 22:08:41 UTC |
@tgaddair mind keeping all the horovod code in the accelerator? ie: we shouldn't call: if ddp: it should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the horovod accelerator as described so we can keep the code clean and simple...
thanks!
This pull request is now in conflict... :( |
0a8fdae
to
6450f1a
Compare
Great! If I understand correctly, each accelerator backend will now (optionally) implement |
@williamFalcon @edenlightning @teddykoker this is ready to go. |
@SeanNaren @tchaton mind prioritizing this to land? thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks great ! Some minor changes and one question about your first test.
Co-authored-by: Jirka Borovec <[email protected]>
Co-authored-by: Jirka Borovec <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
pinging @teddykoker to have a look since it touches the metric package (albeit not too intrusive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! At some point we'll need a way of providing the sync function to metrics automatically
* Fixed early stopping for Horovod * Refactored to sync_dist_if_available * Bump min Horovod version to support hvd.is_initialized * Changelog * Added back change for Horovod * Removed redundant checks for initialization * Implement metrics gathering for Horovod * Added test for EvalResult * Renamed ddp_sync_on_step -> dist_sync_on_step * Added metric test for Horovod * Added option pass callable allgather function to metric base class * Added dist_sync_fn * Fixed calls to private _sync_dist * Fixed Horovod test * Added sync_tensor to the distributed backend * Skip Windows * Insert test path * Removed redundant import * Updated drone * Unset HOROVOD_GPU_ALLREDUCE * Unset * No cache dir * No uninstall * Unset variables * Uninstall Horovod during initialization * Replaced more references to ddp_sync_on_step * Fixed imports * Fixed attribute * Added back default * Lint * Added back docstring * Made gather_all_tensors default * Added whitespace * Update tests/models/test_horovod.py Co-authored-by: Jirka Borovec <[email protected]> * Update pytorch_lightning/metrics/metric.py Co-authored-by: Jirka Borovec <[email protected]> * Update CHANGELOG.md Co-authored-by: Teddy Koker <[email protected]> Co-authored-by: Sean Naren <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> (cherry picked from commit 51cc7a8)
* Fixed early stopping for Horovod * Refactored to sync_dist_if_available * Bump min Horovod version to support hvd.is_initialized * Changelog * Added back change for Horovod * Removed redundant checks for initialization * Implement metrics gathering for Horovod * Added test for EvalResult * Renamed ddp_sync_on_step -> dist_sync_on_step * Added metric test for Horovod * Added option pass callable allgather function to metric base class * Added dist_sync_fn * Fixed calls to private _sync_dist * Fixed Horovod test * Added sync_tensor to the distributed backend * Skip Windows * Insert test path * Removed redundant import * Updated drone * Unset HOROVOD_GPU_ALLREDUCE * Unset * No cache dir * No uninstall * Unset variables * Uninstall Horovod during initialization * Replaced more references to ddp_sync_on_step * Fixed imports * Fixed attribute * Added back default * Lint * Added back docstring * Made gather_all_tensors default * Added whitespace * Update tests/models/test_horovod.py Co-authored-by: Jirka Borovec <[email protected]> * Update pytorch_lightning/metrics/metric.py Co-authored-by: Jirka Borovec <[email protected]> * Update CHANGELOG.md Co-authored-by: Teddy Koker <[email protected]> Co-authored-by: Sean Naren <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> (cherry picked from commit 51cc7a8)
* Fixed early stopping for Horovod * Refactored to sync_dist_if_available * Bump min Horovod version to support hvd.is_initialized * Changelog * Added back change for Horovod * Removed redundant checks for initialization * Implement metrics gathering for Horovod * Added test for EvalResult * Renamed ddp_sync_on_step -> dist_sync_on_step * Added metric test for Horovod * Added option pass callable allgather function to metric base class * Added dist_sync_fn * Fixed calls to private _sync_dist * Fixed Horovod test * Added sync_tensor to the distributed backend * Skip Windows * Insert test path * Removed redundant import * Updated drone * Unset HOROVOD_GPU_ALLREDUCE * Unset * No cache dir * No uninstall * Unset variables * Uninstall Horovod during initialization * Replaced more references to ddp_sync_on_step * Fixed imports * Fixed attribute * Added back default * Lint * Added back docstring * Made gather_all_tensors default * Added whitespace * Update tests/models/test_horovod.py Co-authored-by: Jirka Borovec <[email protected]> * Update pytorch_lightning/metrics/metric.py Co-authored-by: Jirka Borovec <[email protected]> * Update CHANGELOG.md Co-authored-by: Teddy Koker <[email protected]> Co-authored-by: Sean Naren <[email protected]> Co-authored-by: Jirka Borovec <[email protected]>
What does this PR do?
Fixes #3381 by introducing Horovod metrics aggregation alongside support for DDP. Now, whenever a metric needs to be aggregated across workers, we will check if DDP is initialized, then Horovod if it is available, before returning the original value.
This means that
sync_ddp
has been renamedsync_dist
, and a new functionsync_dist_if_available
has been added to check each framework individually.Because this change relies un functionality introduced in Horovod v0.20.0, the minimum supported Horovod version has been bumped up accordingly.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃