Skip to content

Commit

Permalink
Add beam search peptide decoding (#87)
Browse files Browse the repository at this point in the history
* Add beam search

* Delete print statements

* Automatically download model weights (#68) (#88)

* Download model weights from GitHub release

* Include dependencies

* Update model usage documentation

* Reformat with black

* Download weights to the OS-specific app dir

* Don't download weights if already in cache dir

* Update model file instructions

* Remove release notes from the README

We have this information on the Releases page now.

* Remove explicit model specification from example commands

* Harmonize default parameters and config values

As per discussion on Slack (https://noblelab.slack.com/archives/C01MXN4NWMP/p1659803053573279).

* No need to specify config file by default

This simplifies the examples that most users will want to use.

* Simplify version matching regex

* Remove depthcharge related tests

The transformer tests only deal with depthcharge functionality and just seem copied from its repository.

* Make sure that package data is included

I.e. the config YAML file.

* Remove obsolote (ppx) tests

* Update integration test

* Add MacOS support and support for Apple's MPS chips

* Fail test but print version

* Added n_worker fn and tests

* Create split_version fn and add unit tests

* Fix debugging unit test

* Explicitly set version

* Monkeypatch loaded version

* Add device selector, so that on CPU-only runs the devices > 0

* Add windows patch

* Fix typo

* Revert

* Use main process for data loading on Windows

* Fix typo

* Fix unit test

* Fix devices for when num_workers == 0

* Fix devices for when num_workers == 0

* Minor README updates

* Import reordering

* Minor code and docstring reformatting

* Test model weights retrieval

* Fix getting the number of devices

* Disable excessive Tensorboard deprecation warnings

* Don't use worker threads on MacOS

It crashes the DataLoader: pytorch/pytorch#70344

* Warnings need to be ignored before import

* Additional weights tests

- Non-matching version
- GitHub rate limit exceeded

* Disable tests on MacOS

* Include Python 3.10 as supported version

Co-authored-by: William Fondrie <[email protected]>

Co-authored-by: Wout Bittremieux <[email protected]>
Co-authored-by: William Fondrie <[email protected]>

* Automatically download model weights (#68) (#89)

* Download model weights from GitHub release

* Include dependencies

* Update model usage documentation

* Reformat with black

* Download weights to the OS-specific app dir

* Don't download weights if already in cache dir

* Update model file instructions

* Remove release notes from the README

We have this information on the Releases page now.

* Remove explicit model specification from example commands

* Harmonize default parameters and config values

As per discussion on Slack (https://noblelab.slack.com/archives/C01MXN4NWMP/p1659803053573279).

* No need to specify config file by default

This simplifies the examples that most users will want to use.

* Simplify version matching regex

* Remove depthcharge related tests

The transformer tests only deal with depthcharge functionality and just seem copied from its repository.

* Make sure that package data is included

I.e. the config YAML file.

* Remove obsolote (ppx) tests

* Update integration test

* Add MacOS support and support for Apple's MPS chips

* Fail test but print version

* Added n_worker fn and tests

* Create split_version fn and add unit tests

* Fix debugging unit test

* Explicitly set version

* Monkeypatch loaded version

* Add device selector, so that on CPU-only runs the devices > 0

* Add windows patch

* Fix typo

* Revert

* Use main process for data loading on Windows

* Fix typo

* Fix unit test

* Fix devices for when num_workers == 0

* Fix devices for when num_workers == 0

* Minor README updates

* Import reordering

* Minor code and docstring reformatting

* Test model weights retrieval

* Fix getting the number of devices

* Disable excessive Tensorboard deprecation warnings

* Don't use worker threads on MacOS

It crashes the DataLoader: pytorch/pytorch#70344

* Warnings need to be ignored before import

* Additional weights tests

- Non-matching version
- GitHub rate limit exceeded

* Disable tests on MacOS

* Include Python 3.10 as supported version

Co-authored-by: William Fondrie <[email protected]>

Co-authored-by: Wout Bittremieux <[email protected]>
Co-authored-by: William Fondrie <[email protected]>

* Break beam search to testable subfunctions

* Fix precursor m/z termination and filtering

* Add unit testing for beam search

* Add beamsearch comments and fix formatting

* Address requested changes and minor fixes

* Add more unit tests for beam search

* Check NH3 loss for early stopping

* Consistent parameter order

* Update docstrings

* Remove unused precursors parameter

* Update beam matching mask in a level higher

* Minor refactoring to avoid code duplication

* Update imports

* Simplification refactoring

* Fix unit tests

* Simplify predicted peptide caching

* Simplify predicted peptide caching

* Simplify predicted peptide caching

* Unify predicted peptide caching

* Restrict tensor reshape to subfunction and minor fixes

* Finish beams when all isotopes exceed the precursor m/z tolerance

* Generalize look-ahead for tokens with negative mass

* Remove greedy decoding functionality

* Handle case with unfinished beams and add test

* Upgrade required depthcharge version

* Use detokenize function

* Add test for negative mass-aware termination

* Fix egative mass-aware beam termination

* Minor refactoring

* Add test for dummy output at max length

* Fixed and refactored peptide and scocre mzTab outputs

* Add tests for peptide and score output formatting

* Small fixes

* Update changelog

* Fix changelog update

Co-authored-by: Wout Bittremieux <[email protected]>
Co-authored-by: William Fondrie <[email protected]>
Co-authored-by: Wout Bittremieux <[email protected]>
  • Loading branch information
4 people authored Nov 18, 2022
1 parent a810175 commit dbfabb5
Show file tree
Hide file tree
Showing 8 changed files with 1,077 additions and 66 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
### Changed

- Update PyTorch Lightning global seed setting.
- Use beam search decoding rather than greedy decoding to predict the peptides.

### Fixed

Expand Down
1 change: 1 addition & 0 deletions casanovo/casanovo.py
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,7 @@ def main(
weight_decay=float,
train_batch_size=int,
predict_batch_size=int,
n_beams=int,
max_epochs=int,
num_sanity_val_steps=int,
train_from_scratch=bool,
Expand Down
1 change: 1 addition & 0 deletions casanovo/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ weight_decay: 1e-5
# Training/inference options.
train_batch_size: 32
predict_batch_size: 1024
n_beams: 5

logger:
max_epochs: 30
Expand Down
27 changes: 15 additions & 12 deletions casanovo/denovo/evaluate.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""Methods to evaluate peptide-spectrum predictions."""
import re
from typing import Dict, List, Tuple
from typing import Dict, Iterable, List, Tuple

import numpy as np
from spectrum_utils.utils import mass_diff
Expand Down Expand Up @@ -182,8 +182,8 @@ def aa_match(


def aa_match_batch(
peptides1: List[str],
peptides2: List[str],
peptides1: Iterable,
peptides2: Iterable,
aa_dict: Dict[str, float],
cum_mass_threshold: float = 0.5,
ind_mass_threshold: float = 0.1,
Expand All @@ -194,10 +194,10 @@ def aa_match_batch(
Parameters
----------
peptides1 : List[str]
The first list of (untokenized) peptide sequences to be compared.
peptides2 : List[str]
The second list of (untokenized) peptide sequences to be compared.
peptides1 : Iterable
The first list of peptide sequences to be compared.
peptides2 : Iterable
The second list of peptide sequences to be compared.
aa_dict : Dict[str, float]
Mapping of amino acid tokens to their mass values.
cum_mass_threshold : float
Expand All @@ -221,13 +221,16 @@ def aa_match_batch(
"""
aa_matches_batch, n_aa1, n_aa2 = [], 0, 0
for peptide1, peptide2 in zip(peptides1, peptides2):
tokens1 = re.split(r"(?<=.)(?=[A-Z])", peptide1)
tokens2 = re.split(r"(?<=.)(?=[A-Z])", peptide2)
n_aa1, n_aa2 = n_aa1 + len(tokens1), n_aa2 + len(tokens2)
# Split peptides into individual AAs if necessary.
if isinstance(peptide1, str):
peptide1 = re.split(r"(?<=.)(?=[A-Z])", peptide1)
if isinstance(peptide2, str):
peptide2 = re.split(r"(?<=.)(?=[A-Z])", peptide2)
n_aa1, n_aa2 = n_aa1 + len(peptide1), n_aa2 + len(peptide2)
aa_matches_batch.append(
aa_match(
tokens1,
tokens2,
peptide1,
peptide2,
aa_dict,
cum_mass_threshold,
ind_mass_threshold,
Expand Down
Loading

0 comments on commit dbfabb5

Please sign in to comment.