-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug-fix] Trainer.test points to latest best_model_path #5161
Conversation
@@ -12,6 +12,7 @@ | |||
# See the License for the specific language governing permissions and | |||
# limitations under the License. | |||
# Running special tests | |||
set -e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed special_tests.sh doesn't return an error when a test fails. I found set -e might.
} | ||
|
||
def on_load_checkpoint(self, checkpointed_state: Dict[str, Any]): | ||
self.best_model_score = checkpointed_state["best_model_score"] | ||
self.best_model_path = checkpointed_state["best_model_path"] | ||
self.dirpath = checkpointed_state.get("dirpath", self.dirpath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if dirpath
is changed when Trainer is reinitialized with a new checkpoint callback??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, I don't see it being used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question ! I had a chat with Adrian. It is more a design choice. Let's say we want to fine-tune a model from a given checkpoint. It would make sense for the new checkpoint to be saved in the same folder. Happy to brainstorm on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are more things we need to restore like save_top_k
, best_k_models
, etc. So I'd suggest to keep it for another PR. Also need to take care of the conditions when the new checkpoint instance have different dirpath
when doing finetuning. There is an open issue for the same.
def resolve_resume_from_checkpoint(self): | ||
if not self._trainer_has_checkpoint_callbacks(): | ||
return self.trainer.resume_from_checkpoint | ||
checkpoint_callbacks = self.trainer.checkpoint_callbacks[0] | ||
if os.path.exists(checkpoint_callbacks.best_model_path): | ||
resume_from_checkpoint_options = [ | ||
checkpoint_callbacks.best_model_path, | ||
self.trainer.resume_from_checkpoint | ||
] | ||
resume_from_checkpoint_options.sort() | ||
return resume_from_checkpoint_options[-1] | ||
return self.trainer.resume_from_checkpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct. Here if I set test(ckpt_path=some_checkpoint_path)
, it will possibly reload the best_model_path again during restore. Why does it even do reload from the resume_checkpoint while doing testing?
I think to better resolve this issue this should be fixed/refactored correctly:
https://github.com/PyTorchLightning/pytorch-lightning/blob/176735097ab5be9ee21d3e7a3dedc174f3e0dd3f/pytorch_lightning/accelerators/gpu_accelerator.py#L61-L69
since while training/testing setup_training
is called, which contains few things that are required during both training & testing and others required only during training(for eg restoring checkpoint).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rohitgr7.
Here was the problem I was trying to resolve.
- Load a checkpoint from
resume_from_checkpoint
. - Fine-tune the model for several epochs which might create a new best_model_path checkpoint
- When calling
trainer.test()
, it should use the new best_model_path and notresume_from_checkpoint
.
If I understand properly, you are saying we should skip this restore from resume_from_checkpoint
in .test
and load directly from best_model_path ?
Best,
T.C
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, best checkpoint or any other checkpoint, it will be loaded here https://github.com/PyTorchLightning/pytorch-lightning/blob/d1e97a4f114a285349e31e330c7bf8937bc1ee04/pytorch_lightning/trainer/trainer.py#L770-L785, so we can just skip the .restore
call. Also a few more hooks are called while doing .test
like on_pretrain_routine_start
and on_pretrain_routine_end
since it calls setup_training
when doing either train/test, which is incorrect too IMO. So I'd suggest to split/refactor setup_training
a bit.
Codecov Report
@@ Coverage Diff @@
## master #5161 +/- ##
======================================
Coverage 93% 93%
======================================
Files 134 134
Lines 9970 9976 +6
======================================
+ Hits 9286 9294 +8
+ Misses 684 682 -2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One general question: If I pretrain on one PC, does this mean, I cannot finetune on another if the filesustem isn't the same?
Hello @tchaton! Thanks for updating this PR.
Comment last updated at 2021-01-05 08:52:09 UTC |
…hub.com/PyTorchLightning/pytorch-lightning into bugfix/5091_resume_from_checkpoint_test
…hub.com/PyTorchLightning/pytorch-lightning into bugfix/5091_resume_from_checkpoint_test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing chnagelog
Done ! |
* resolve bug * update code * add set -e * Update pytorch_lightning/callbacks/model_checkpoint.py Co-authored-by: Adrian Wälchli <[email protected]> * update test * Update tests/checkpointing/test_trainer_checkpoint.py Co-authored-by: Sean Naren <[email protected]> * Update tests/checkpointing/test_trainer_checkpoint.py Co-authored-by: Carlos Mocholí <[email protected]> * update on comments * resolve test * convert to set * update * add error triggering * update * update on comments * update * resolve import * update * update * Update pytorch_lightning/plugins/rpc_plugin.py Co-authored-by: Jirka Borovec <[email protected]> * update Co-authored-by: Adrian Wälchli <[email protected]> Co-authored-by: Sean Naren <[email protected]> Co-authored-by: Carlos Mocholí <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Jirka Borovec <[email protected]> (cherry picked from commit d5b3678)
@tchaton by not restoring when testing, this use case breaks:
this is because we skip calling A concrete use case to motivate this is we developed an Exponential Moving Average Callback. We load the EMA state in a callback with the I think we should be able to restore other states when resuming for testing. WDYT? |
seems reasonable to load the trainer & callback states at least 👍, or should we have |
What does this PR do?
Fixes #5091 #5318 #5288
The test for PipeRCP was failing. This PR also resolves it.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃