Add beam search peptide decoding (#87)

* Add beam search * Delete print statements * Automatically download model weights (#68) (#88) * Download model weights from GitHub release * Include dependencies * Update model usage documentation * Reformat with black * Download weights to the OS-specific app dir * Don't download weights if already in cache dir * Update model file instructions * Remove release notes from the README We have this information on the Releases page now. * Remove explicit model specification from example commands * Harmonize default parameters and config values As per discussion on Slack (https://noblelab.slack.com/archives/C01MXN4NWMP/p1659803053573279). * No need to specify config file by default This simplifies the examples that most users will want to use. * Simplify version matching regex * Remove depthcharge related tests The transformer tests only deal with depthcharge functionality and just seem copied from its repository. * Make sure that package data is included I.e. the config YAML file. * Remove obsolote (ppx) tests * Update integration test * Add MacOS support and support for Apple's MPS chips * Fail test but print version * Added n_worker fn and tests * Create split_version fn and add unit tests * Fix debugging unit test * Explicitly set version * Monkeypatch loaded version * Add device selector, so that on CPU-only runs the devices > 0 * Add windows patch * Fix typo * Revert * Use main process for data loading on Windows * Fix typo * Fix unit test * Fix devices for when num_workers == 0 * Fix devices for when num_workers == 0 * Minor README updates * Import reordering * Minor code and docstring reformatting * Test model weights retrieval * Fix getting the number of devices * Disable excessive Tensorboard deprecation warnings * Don't use worker threads on MacOS It crashes the DataLoader: pytorch/pytorch#70344 * Warnings need to be ignored before import * Additional weights tests - Non-matching version - GitHub rate limit exceeded * Disable tests on MacOS * Include Python 3.10 as supported version Co-authored-by: William Fondrie <[email protected]> Co-authored-by: Wout Bittremieux <[email protected]> Co-authored-by: William Fondrie <[email protected]> * Automatically download model weights (#68) (#89) * Download model weights from GitHub release * Include dependencies * Update model usage documentation * Reformat with black * Download weights to the OS-specific app dir * Don't download weights if already in cache dir * Update model file instructions * Remove release notes from the README We have this information on the Releases page now. * Remove explicit model specification from example commands * Harmonize default parameters and config values As per discussion on Slack (https://noblelab.slack.com/archives/C01MXN4NWMP/p1659803053573279). * No need to specify config file by default This simplifies the examples that most users will want to use. * Simplify version matching regex * Remove depthcharge related tests The transformer tests only deal with depthcharge functionality and just seem copied from its repository. * Make sure that package data is included I.e. the config YAML file. * Remove obsolote (ppx) tests * Update integration test * Add MacOS support and support for Apple's MPS chips * Fail test but print version * Added n_worker fn and tests * Create split_version fn and add unit tests * Fix debugging unit test * Explicitly set version * Monkeypatch loaded version * Add device selector, so that on CPU-only runs the devices > 0 * Add windows patch * Fix typo * Revert * Use main process for data loading on Windows * Fix typo * Fix unit test * Fix devices for when num_workers == 0 * Fix devices for when num_workers == 0 * Minor README updates * Import reordering * Minor code and docstring reformatting * Test model weights retrieval * Fix getting the number of devices * Disable excessive Tensorboard deprecation warnings * Don't use worker threads on MacOS It crashes the DataLoader: pytorch/pytorch#70344 * Warnings need to be ignored before import * Additional weights tests - Non-matching version - GitHub rate limit exceeded * Disable tests on MacOS * Include Python 3.10 as supported version Co-authored-by: William Fondrie <[email protected]> Co-authored-by: Wout Bittremieux <[email protected]> Co-authored-by: William Fondrie <[email protected]> * Break beam search to testable subfunctions * Fix precursor m/z termination and filtering * Add unit testing for beam search * Add beamsearch comments and fix formatting * Address requested changes and minor fixes * Add more unit tests for beam search * Check NH3 loss for early stopping * Consistent parameter order * Update docstrings * Remove unused precursors parameter * Update beam matching mask in a level higher * Minor refactoring to avoid code duplication * Update imports * Simplification refactoring * Fix unit tests * Simplify predicted peptide caching * Simplify predicted peptide caching * Simplify predicted peptide caching * Unify predicted peptide caching * Restrict tensor reshape to subfunction and minor fixes * Finish beams when all isotopes exceed the precursor m/z tolerance * Generalize look-ahead for tokens with negative mass * Remove greedy decoding functionality * Handle case with unfinished beams and add test * Upgrade required depthcharge version * Use detokenize function * Add test for negative mass-aware termination * Fix egative mass-aware beam termination * Minor refactoring * Add test for dummy output at max length * Fixed and refactored peptide and scocre mzTab outputs * Add tests for peptide and score output formatting * Small fixes * Update changelog * Fix changelog update Co-authored-by: Wout Bittremieux <[email protected]> Co-authored-by: William Fondrie <[email protected]> Co-authored-by: Wout Bittremieux <[email protected]>
Noble-Lab · Nov 18, 2022 · dbfabb5 · dbfabb5
1 parent a810175
commit dbfabb5
Show file tree

Hide file tree

Showing 8 changed files with 1,077 additions and 66 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 ### Changed
 
 - Update PyTorch Lightning global seed setting.
+- Use beam search decoding rather than greedy decoding to predict the peptides.
 
 ### Fixed
 

diff --git a/casanovo/casanovo.py b/casanovo/casanovo.py
@@ -156,6 +156,7 @@ def main(
         weight_decay=float,
         train_batch_size=int,
         predict_batch_size=int,
+        n_beams=int,
         max_epochs=int,
         num_sanity_val_steps=int,
         train_from_scratch=bool,

diff --git a/casanovo/config.yaml b/casanovo/config.yaml
@@ -65,6 +65,7 @@ weight_decay: 1e-5
 # Training/inference options.
 train_batch_size: 32
 predict_batch_size: 1024
+n_beams: 5
 
 logger:
 max_epochs: 30

diff --git a/casanovo/denovo/evaluate.py b/casanovo/denovo/evaluate.py
@@ -1,6 +1,6 @@
 """Methods to evaluate peptide-spectrum predictions."""
 import re
-from typing import Dict, List, Tuple
+from typing import Dict, Iterable, List, Tuple
 
 import numpy as np
 from spectrum_utils.utils import mass_diff
@@ -182,8 +182,8 @@ def aa_match(
 
 
 def aa_match_batch(
-    peptides1: List[str],
-    peptides2: List[str],
+    peptides1: Iterable,
+    peptides2: Iterable,
     aa_dict: Dict[str, float],
     cum_mass_threshold: float = 0.5,
     ind_mass_threshold: float = 0.1,
@@ -194,10 +194,10 @@ def aa_match_batch(
 
     Parameters
     ----------
-    peptides1 : List[str]
-        The first list of (untokenized) peptide sequences to be compared.
-    peptides2 : List[str]
-        The second list of (untokenized) peptide sequences to be compared.
+    peptides1 : Iterable
+        The first list of peptide sequences to be compared.
+    peptides2 : Iterable
+        The second list of peptide sequences to be compared.
     aa_dict : Dict[str, float]
         Mapping of amino acid tokens to their mass values.
     cum_mass_threshold : float
@@ -221,13 +221,16 @@ def aa_match_batch(
     """
     aa_matches_batch, n_aa1, n_aa2 = [], 0, 0
     for peptide1, peptide2 in zip(peptides1, peptides2):
-        tokens1 = re.split(r"(?<=.)(?=[A-Z])", peptide1)
-        tokens2 = re.split(r"(?<=.)(?=[A-Z])", peptide2)
-        n_aa1, n_aa2 = n_aa1 + len(tokens1), n_aa2 + len(tokens2)
+        # Split peptides into individual AAs if necessary.
+        if isinstance(peptide1, str):
+            peptide1 = re.split(r"(?<=.)(?=[A-Z])", peptide1)
+        if isinstance(peptide2, str):
+            peptide2 = re.split(r"(?<=.)(?=[A-Z])", peptide2)
+        n_aa1, n_aa2 = n_aa1 + len(peptide1), n_aa2 + len(peptide2)
         aa_matches_batch.append(
             aa_match(
-                tokens1,
-                tokens2,
+                peptide1,
+                peptide2,
                 aa_dict,
                 cum_mass_threshold,
                 ind_mass_threshold,