Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CPU bug, overhaul model runner, and update to lightning >=2.0 #176

Merged
merged 18 commits into from
May 10, 2023

Conversation

wfondrie
Copy link
Collaborator

@wfondrie wfondrie commented Apr 22, 2023

This is a huge PR that:

  • Hopefully fixes CPU-related bugs (Integration test fails on no-GPU machine #106). It works locally for me, even with MPS disabled.
  • Removes the no_gpu parameter, in favor of providing accelerator and devices parameters to allow users to select custom devices readily without negating CUDA_VISIBLE_DEVICES. I think this is the best comprimise for Add num_gpus config parameter #173.
  • Completely refactors model_runner.py into a class, ModelRunner. I found it annoying to have to change arguments for models and such in multiple spots, so I think this change will make it much more maintainable.
  • Makes a bunch of minor tweaks required for Pytorch-Lighting 2.0+. One of those is changing how metric logging and PSM recording is done. PSM writing is now done using the on_predict_batch_end rather than on_predict_epoch_end because the latter seems to no longer receive the predict results. The newer version of Lightning doesn't allow for dictionary metrics to be logged in the way we were doing before, so please pay attention to the changes in review.
  • The integration test was overhauled to make it useful. It now trains a tiny model, evaluates it on a tiny file, and predicts the sequences for the same file.
  • Limits prediction to using 1 device.

I'm tagging both @bittremieux and @melihyilmaz to review this one since this one is so big and I don't want to mess something up 🙈.

@codecov
Copy link

codecov bot commented Apr 22, 2023

Codecov Report

Merging #176 (8208b93) into main (d023fa9) will increase coverage by 8.53%.
The diff coverage is 86.90%.

@@            Coverage Diff             @@
##             main     #176      +/-   ##
==========================================
+ Coverage   79.10%   87.63%   +8.53%     
==========================================
  Files          11       12       +1     
  Lines         804      833      +29     
==========================================
+ Hits          636      730      +94     
+ Misses        168      103      -65     
Impacted Files Coverage Δ
casanovo/config.py 86.84% <ø> (+1.47%) ⬆️
casanovo/denovo/model.py 95.80% <82.60%> (+16.79%) ⬆️
casanovo/denovo/dataloaders.py 98.18% <85.71%> (+12.99%) ⬆️
casanovo/denovo/model_runner.py 87.59% <85.83%> (+38.51%) ⬆️
casanovo/casanovo.py 63.63% <100.00%> (-27.35%) ⬇️
casanovo/data/ms_io.py 96.66% <100.00%> (+0.11%) ⬆️
casanovo/denovo/__init__.py 100.00% <100.00%> (ø)

... and 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@wfondrie
Copy link
Collaborator Author

All tests passing on Linux, Windows, and MacOS CPU runners! 🎉

Copy link
Collaborator

@bittremieux bittremieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice changes, I just have a few minor comments.

casanovo/casanovo.py Outdated Show resolved Hide resolved
casanovo/config.yaml Outdated Show resolved Hide resolved
casanovo/denovo/dataloaders.py Outdated Show resolved Hide resolved
casanovo/denovo/dataloaders.py Show resolved Hide resolved
casanovo/denovo/model.py Show resolved Hide resolved
tests/test_integration.py Show resolved Hide resolved
tests/test_integration.py Show resolved Hide resolved
casanovo/denovo/model_runner.py Outdated Show resolved Hide resolved
casanovo/denovo/model_runner.py Show resolved Hide resolved
casanovo/denovo/model_runner.py Outdated Show resolved Hide resolved
@wfondrie
Copy link
Collaborator Author

Thanks @bittremieux! Your turn now @melihyilmaz 🚀

Copy link
Collaborator

@melihyilmaz melihyilmaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stumbled upon one issue when testing locally, which should be fixed with my recent commit, but everything else looks great. @wfondrie Can you take a look at the only failing test - I wasn't sure if we need to tweak test or my recent commit ? Feel free to merge afterwards!

Thanks!

casanovo/denovo/model_runner.py Outdated Show resolved Hide resolved
@wfondrie
Copy link
Collaborator Author

Ok, so I think I have something that'll work pretty robustly: the model weights are loaded onto the current PyTorch default device, which is normally CPU. However, this let's us test it by changing the default device to the meta device that @bittremieux mentioned, which I've done here.

Also, I went ahead and included the initialization parameters with the model weights, so that loading weights is independent of the configuration provided --- the model will always match the loaded weights, except for the event of major architecture changes (Issue #156)

@wfondrie
Copy link
Collaborator Author

Good thing the integration tests were working, because it helped me catch a bug!

Anyway, I also changed checkpointing behavior to only keep the top 5 checkboints based on validation CE loss, but changed the save_models parameter to save_top_k, allowing us to set it at -1 if we reeeeally want all the checkpoints.

@wfondrie wfondrie linked an issue May 10, 2023 that may be closed by this pull request
@wfondrie
Copy link
Collaborator Author

@melihyilmaz and @bittremieux - any objections to my updates?

Copy link
Collaborator

@bittremieux bittremieux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wfondrie updates look good to me.

Copy link
Collaborator

@melihyilmaz melihyilmaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wfondrie, I've tested locally and identified only two issues that arose with the recent commits.

casanovo/denovo/model_runner.py Outdated Show resolved Hide resolved
casanovo/denovo/model_runner.py Show resolved Hide resolved
Copy link
Collaborator

@melihyilmaz melihyilmaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more issues, good to merge!

@melihyilmaz melihyilmaz merged commit 6299bd2 into main May 10, 2023
@melihyilmaz melihyilmaz deleted the cpu-bug branch May 10, 2023 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants