-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add beam search peptide decoding #87
Conversation
Codecov Report
@@ Coverage Diff @@
## main #87 +/- ##
==========================================
+ Coverage 75.15% 79.08% +3.92%
==========================================
Files 10 10
Lines 644 784 +140
==========================================
+ Hits 484 620 +136
- Misses 160 164 +4
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
* Download model weights from GitHub release * Include dependencies * Update model usage documentation * Reformat with black * Download weights to the OS-specific app dir * Don't download weights if already in cache dir * Update model file instructions * Remove release notes from the README We have this information on the Releases page now. * Remove explicit model specification from example commands * Harmonize default parameters and config values As per discussion on Slack (https://noblelab.slack.com/archives/C01MXN4NWMP/p1659803053573279). * No need to specify config file by default This simplifies the examples that most users will want to use. * Simplify version matching regex * Remove depthcharge related tests The transformer tests only deal with depthcharge functionality and just seem copied from its repository. * Make sure that package data is included I.e. the config YAML file. * Remove obsolote (ppx) tests * Update integration test * Add MacOS support and support for Apple's MPS chips * Fail test but print version * Added n_worker fn and tests * Create split_version fn and add unit tests * Fix debugging unit test * Explicitly set version * Monkeypatch loaded version * Add device selector, so that on CPU-only runs the devices > 0 * Add windows patch * Fix typo * Revert * Use main process for data loading on Windows * Fix typo * Fix unit test * Fix devices for when num_workers == 0 * Fix devices for when num_workers == 0 * Minor README updates * Import reordering * Minor code and docstring reformatting * Test model weights retrieval * Fix getting the number of devices * Disable excessive Tensorboard deprecation warnings * Don't use worker threads on MacOS It crashes the DataLoader: pytorch/pytorch#70344 * Warnings need to be ignored before import * Additional weights tests - Non-matching version - GitHub rate limit exceeded * Disable tests on MacOS * Include Python 3.10 as supported version Co-authored-by: William Fondrie <[email protected]> Co-authored-by: Wout Bittremieux <[email protected]> Co-authored-by: William Fondrie <[email protected]>
* Download model weights from GitHub release * Include dependencies * Update model usage documentation * Reformat with black * Download weights to the OS-specific app dir * Don't download weights if already in cache dir * Update model file instructions * Remove release notes from the README We have this information on the Releases page now. * Remove explicit model specification from example commands * Harmonize default parameters and config values As per discussion on Slack (https://noblelab.slack.com/archives/C01MXN4NWMP/p1659803053573279). * No need to specify config file by default This simplifies the examples that most users will want to use. * Simplify version matching regex * Remove depthcharge related tests The transformer tests only deal with depthcharge functionality and just seem copied from its repository. * Make sure that package data is included I.e. the config YAML file. * Remove obsolote (ppx) tests * Update integration test * Add MacOS support and support for Apple's MPS chips * Fail test but print version * Added n_worker fn and tests * Create split_version fn and add unit tests * Fix debugging unit test * Explicitly set version * Monkeypatch loaded version * Add device selector, so that on CPU-only runs the devices > 0 * Add windows patch * Fix typo * Revert * Use main process for data loading on Windows * Fix typo * Fix unit test * Fix devices for when num_workers == 0 * Fix devices for when num_workers == 0 * Minor README updates * Import reordering * Minor code and docstring reformatting * Test model weights retrieval * Fix getting the number of devices * Disable excessive Tensorboard deprecation warnings * Don't use worker threads on MacOS It crashes the DataLoader: pytorch/pytorch#70344 * Warnings need to be ignored before import * Additional weights tests - Non-matching version - GitHub rate limit exceeded * Disable tests on MacOS * Include Python 3.10 as supported version Co-authored-by: William Fondrie <[email protected]> Co-authored-by: Wout Bittremieux <[email protected]> Co-authored-by: William Fondrie <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. This is non-trivial functionality to implement.
Here is some initial feedback. I'll probably want to do another thorough review after these comments have been addressed.
Some comments are relevant to multiple places in the code, but I haven't repeated the same thing multiple times, so be a bit mindful of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's looking pretty good. I have a few comments / requests for clarification. I especially don't understand the latest NH3 loss-related changes.
Added a small fix to negative mass-aware termination (to ensure we're not terminating a beam if doesn't exceed tolerance under any negative mass token) and unit tests covering that functionality. Only that and 1 other unresolved conversation above remains. |
Add beam search decoding to replace greedy decoding. Organized
beam_search_decode()
as a series of sub-functions to allow for unit testing each.From a larger group of peptides predictions, this implementation of beam search decoding caches k highest scoring peptide predictions for each spectrum, prioritizing the peptides fitting the observed precursor mass. As an output, the highest scoring peptide within precursor m/z tolerance, is returned, i.e. a single PSM for each spectra is recorded in the output mzTab file. If there are no cached peptides within precursor m/z tolerance for a spectrum, the highest scoring peptide is returned in the output.
Also fixed amino acid-level score calculation in
model.on_predict_epoch_end()
, which was previously retrieving a list of shifted-by-one-AA amino acid scores, i.e. first AA prediction being assigned the second score, second AA the third score etc. where last AA was assigned the score for the stop toke.