-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CPU bug, overhaul model runner, and update to lightning >=2.0 #176
Conversation
Codecov Report
@@ Coverage Diff @@
## main #176 +/- ##
==========================================
+ Coverage 79.10% 87.63% +8.53%
==========================================
Files 11 12 +1
Lines 804 833 +29
==========================================
+ Hits 636 730 +94
+ Misses 168 103 -65
... and 2 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
All tests passing on Linux, Windows, and MacOS CPU runners! 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice changes, I just have a few minor comments.
Thanks @bittremieux! Your turn now @melihyilmaz 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stumbled upon one issue when testing locally, which should be fixed with my recent commit, but everything else looks great. @wfondrie Can you take a look at the only failing test - I wasn't sure if we need to tweak test or my recent commit ? Feel free to merge afterwards!
Thanks!
Ok, so I think I have something that'll work pretty robustly: the model weights are loaded onto the current PyTorch default device, which is normally CPU. However, this let's us test it by changing the default device to the Also, I went ahead and included the initialization parameters with the model weights, so that loading weights is independent of the configuration provided --- the model will always match the loaded weights, except for the event of major architecture changes (Issue #156) |
Good thing the integration tests were working, because it helped me catch a bug! Anyway, I also changed checkpointing behavior to only keep the top 5 checkboints based on validation CE loss, but changed the |
@melihyilmaz and @bittremieux - any objections to my updates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wfondrie updates look good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wfondrie, I've tested locally and identified only two issues that arose with the recent commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more issues, good to merge!
This is a huge PR that:
no_gpu
parameter, in favor of providingaccelerator
anddevices
parameters to allow users to select custom devices readily without negatingCUDA_VISIBLE_DEVICES
. I think this is the best comprimise for Add num_gpus config parameter #173.model_runner.py
into a class,ModelRunner
. I found it annoying to have to change arguments for models and such in multiple spots, so I think this change will make it much more maintainable.on_predict_batch_end
rather thanon_predict_epoch_end
because the latter seems to no longer receive the predict results. The newer version of Lightning doesn't allow for dictionary metrics to be logged in the way we were doing before, so please pay attention to the changes in review.I'm tagging both @bittremieux and @melihyilmaz to review this one since this one is so big and I don't want to mess something up 🙈.