-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] Test removing hpc save/load #7537
Conversation
# call hpc specific hook | ||
model.on_hpc_load(checkpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to deprecate these hooks in favor of on_save/load_checkpoint
Codecov Report
@@ Coverage Diff @@
## master #7537 +/- ##
=======================================
- Coverage 92% 88% -4%
=======================================
Files 196 195 -1
Lines 12835 12757 -78
=======================================
- Hits 11832 11198 -634
- Misses 1003 1559 +556 |
log = logging.getLogger(__name__) | ||
|
||
|
||
class SLURMConnector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the signal handling is not tested, so no surprise there are no failing tests.
But as you can see in the code below, this handles automatic job resubmission, and if you remove it, Lightning will lose support for this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it expected that lightning needs to natively handle automatic job resubmission? or does it make sense for this to be part of the application using Lightning to handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is expected and a pretty important feature of the Lightning framework :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- would it make sense for this to be part of the SLURMEnvironment if users wanted to disable automatic resubmission?
- should resubmission be opt-in instead of opt-out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense for this to be part of the SLURMEnvironment if users wanted to disable automatic resubmission?
yes
#6303
@ananthsub yes... this is a very important feature and expected by lightning. let's keep this in the framework. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://pytorch-lightning.readthedocs.io/en/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Slack. Thank you for your contributions. |
This pull request is going to be closed. Please feel free to reopen it create a new from the actual master. |
What does this PR do?
This is redundant with the existing model checkpoint save/load/resume functionality. we shouldn't have 2 paths in the codebase that do the same thing
sending out as a PR to see what breaks
Fixes #<issue_number>
Related issues: #6204, #5373
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃