-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resiliency features update #9979
Merged
jbieniusiewi
merged 1 commit into
main
from
cherry-pick-main-7cc3fb20f2d4c7672cf23837a75312c50f70ba2e
Jul 31, 2024
Merged
Resiliency features update #9979
jbieniusiewi
merged 1 commit into
main
from
cherry-pick-main-7cc3fb20f2d4c7672cf23837a75312c50f70ba2e
Jul 31, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jbieniusiewi
approved these changes
Jul 31, 2024
* Added safety_factor; removed logger passing Signed-off-by: Jacek Bieniusiewicz <[email protected]> * updated straggler test: make training epoch longer; use just 2 workers per data loader Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Fixed PreemptionCallback; Added dir param for FT callback Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Doc update Signed-off-by: Jacek Bieniusiewicz <[email protected]> * save_dir -> exp_dir Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * fixed FT exp dir Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Added comment Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Apply isort and black reformatting Signed-off-by: jbieniusiewi <[email protected]> * Removed changes in preemption.py - moved to another PR Signed-off-by: Jacek Bieniusiewicz <[email protected]> * Removed duplicated straggler det. section from documentation Signed-off-by: Jacek Bieniusiewicz <[email protected]> --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Signed-off-by: Jacek Bieniusiewicz <[email protected]>
jbieniusiewi
force-pushed
the
cherry-pick-main-7cc3fb20f2d4c7672cf23837a75312c50f70ba2e
branch
from
July 31, 2024 09:43
59820fa
to
2e11705
Compare
jbieniusiewi
deleted the
cherry-pick-main-7cc3fb20f2d4c7672cf23837a75312c50f70ba2e
branch
July 31, 2024 12:49
xuanzic
pushed a commit
to xuanzic/NeMo
that referenced
this pull request
Aug 1, 2024
* Added safety_factor; removed logger passing * updated straggler test: make training epoch longer; use just 2 workers per data loader * Apply isort and black reformatting * Fixed PreemptionCallback; Added dir param for FT callback * Doc update * save_dir -> exp_dir * Apply isort and black reformatting * fixed FT exp dir * Added comment * Apply isort and black reformatting * Removed changes in preemption.py - moved to another PR * Removed duplicated straggler det. section from documentation --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Signed-off-by: Vivian Chen <[email protected]>
monica-sekoyan
pushed a commit
that referenced
this pull request
Oct 14, 2024
* Added safety_factor; removed logger passing * updated straggler test: make training epoch longer; use just 2 workers per data loader * Apply isort and black reformatting * Fixed PreemptionCallback; Added dir param for FT callback * Doc update * save_dir -> exp_dir * Apply isort and black reformatting * fixed FT exp dir * Added comment * Apply isort and black reformatting * Removed changes in preemption.py - moved to another PR * Removed duplicated straggler det. section from documentation --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]>
hainan-xv
pushed a commit
to hainan-xv/NeMo
that referenced
this pull request
Nov 5, 2024
* Added safety_factor; removed logger passing * updated straggler test: make training epoch longer; use just 2 workers per data loader * Apply isort and black reformatting * Fixed PreemptionCallback; Added dir param for FT callback * Doc update * save_dir -> exp_dir * Apply isort and black reformatting * fixed FT exp dir * Added comment * Apply isort and black reformatting * Removed changes in preemption.py - moved to another PR * Removed duplicated straggler det. section from documentation --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Signed-off-by: Hainan Xu <[email protected]>
XuesongYang
pushed a commit
to paarthneekhara/NeMo
that referenced
this pull request
Jan 18, 2025
* Added safety_factor; removed logger passing * updated straggler test: make training epoch longer; use just 2 workers per data loader * Apply isort and black reformatting * Fixed PreemptionCallback; Added dir param for FT callback * Doc update * save_dir -> exp_dir * Apply isort and black reformatting * fixed FT exp dir * Added comment * Apply isort and black reformatting * Removed changes in preemption.py - moved to another PR * Removed duplicated straggler det. section from documentation --------- Signed-off-by: Jacek Bieniusiewicz <[email protected]> Signed-off-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]> Co-authored-by: jbieniusiewi <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Changes for the updated version of resiliency features:
Collection: [Note which collection this PR will affect]
Changelog
safety_factor
FT paramexp_dir
to the FT callbackUsage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information