Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically download model weights #68

Merged
merged 47 commits into from
Nov 3, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
21e7487
Download model weights from GitHub release
bittremieux Aug 23, 2022
a6467ee
Include dependencies
bittremieux Aug 23, 2022
2949861
Update model usage documentation
bittremieux Aug 23, 2022
7408805
Reformat with black
bittremieux Aug 23, 2022
1c7b1bd
Download weights to the OS-specific app dir
bittremieux Aug 24, 2022
325d050
Don't download weights if already in cache dir
bittremieux Aug 24, 2022
2880967
Update model file instructions
bittremieux Aug 24, 2022
d5e0244
Remove release notes from the README
bittremieux Aug 24, 2022
6d2aa38
Remove explicit model specification from example commands
bittremieux Aug 24, 2022
1692936
Harmonize default parameters and config values
bittremieux Aug 24, 2022
a14f785
No need to specify config file by default
bittremieux Aug 24, 2022
84ea01a
Merge pull request #69 from Noble-Lab/config
bittremieux Aug 27, 2022
1688d68
Simplify version matching regex
bittremieux Sep 9, 2022
904c7fd
Remove depthcharge related tests
bittremieux Sep 9, 2022
96e8c24
Make sure that package data is included
bittremieux Sep 9, 2022
681986f
Merge remote-tracking branch 'origin/main' into weights
bittremieux Sep 9, 2022
965a04a
Remove obsolote (ppx) tests
bittremieux Sep 9, 2022
29a0c36
Update integration test
bittremieux Sep 9, 2022
745b0ce
Resolve merge conflicts
wfondrie Oct 11, 2022
aa7f47f
Add MacOS support and support for Apple's MPS chips
wfondrie Oct 11, 2022
5292da9
Fail test but print version
wfondrie Oct 11, 2022
cd59657
Added n_worker fn and tests
wfondrie Oct 11, 2022
93b53d8
Create split_version fn and add unit tests
wfondrie Oct 11, 2022
f0eba58
Fix debugging unit test
wfondrie Oct 11, 2022
31f5b17
Explicitly set version
wfondrie Oct 11, 2022
27838b8
Monkeypatch loaded version
wfondrie Oct 11, 2022
ea02de0
Add device selector, so that on CPU-only runs the devices > 0
wfondrie Oct 11, 2022
9e03cc9
Add windows patch
wfondrie Oct 25, 2022
b055e6d
Fix typo
wfondrie Oct 26, 2022
a3645fd
Revert
wfondrie Oct 26, 2022
2bb3a55
Use main process for data loading on Windows
wfondrie Oct 26, 2022
683ebbb
Fix typo
wfondrie Oct 26, 2022
c275127
Fix unit test
wfondrie Oct 26, 2022
7057600
Fix devices for when num_workers == 0
wfondrie Oct 26, 2022
58e4ce1
Fix devices for when num_workers == 0
wfondrie Oct 26, 2022
7115c2d
Minor README updates
bittremieux Nov 2, 2022
22ce3bf
Import reordering
bittremieux Nov 2, 2022
8f00696
Minor code and docstring reformatting
bittremieux Nov 2, 2022
af407fa
Test model weights retrieval
bittremieux Nov 2, 2022
98f242d
Merge remote-tracking branch 'origin/main' into weights
bittremieux Nov 2, 2022
767acd4
Fix getting the number of devices
bittremieux Nov 2, 2022
1e8c655
Disable excessive Tensorboard deprecation warnings
bittremieux Nov 2, 2022
e922b76
Don't use worker threads on MacOS
bittremieux Nov 2, 2022
b7188d6
Warnings need to be ignored before import
bittremieux Nov 2, 2022
e7f7df6
Additional weights tests
bittremieux Nov 2, 2022
d6fc99b
Disable tests on MacOS
bittremieux Nov 3, 2022
aec590a
Include Python 3.10 as supported version
bittremieux Nov 3, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
os: [ubuntu-latest, windows-latest]

steps:
- uses: actions/checkout@v2
Expand Down
58 changes: 42 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ If you use Casanovo in your work, please cite the following publication:

- Yilmaz, M., Fondrie, W. E., Bittremieux, W., Oh, S. & Noble, W. S. *De novo* mass spectrometry peptide sequencing with a transformer model. in *Proceedings of the 39th International Conference on Machine Learning - ICML '22* vol. 162 25514–25522 (PMLR, 2022). [https://proceedings.mlr.press/v162/yilmaz22a.html](https://proceedings.mlr.press/v162/yilmaz22a.html)

Data and pre-trained model weights are available [on Zenodo](https://zenodo.org/record/6791263).

## Documentation

#### https://casanovo.readthedocs.io/en/latest/
Expand All @@ -18,7 +16,7 @@ Data and pre-trained model weights are available [on Zenodo](https://zenodo.org/

We recommend to run Casanovo in a dedicated **Anaconda** environment.
This helps keep your environment for Casanovo and its dependencies separate from your other Python environments.
**This is especially helpful because Casanovo works within a specific range of Python versions (3.8 ≥ Python version > 3.10).**
**This is especially helpful because Casanovo works within a specific range of Python versions (3.8 ≥ Python version 3.10).**

- Check out the [Windows](https://docs.anaconda.com/anaconda/install/windows/#), [MacOS](https://docs.anaconda.com/anaconda/install/mac-os/), and [Linux](https://docs.anaconda.com/anaconda/install/linux/) installation instructions.

Expand Down Expand Up @@ -49,9 +47,9 @@ The base environment most likely will not work.

### Installation

Install Casanovo as a Python package from this repository (requires 3.8 ≥ [Python version] > 3.10 , dependencies will be installed automatically as needed):
Install Casanovo as a Python package from this repository (requires 3.8 ≥ [Python version] 3.10 , dependencies will be installed automatically as needed):

```
``` sh
pip install casanovo
```

Expand All @@ -60,12 +58,22 @@ Once installed, Casanovo can be used with a simple command line interface.
All auxiliary data, model, and training-related parameters can be specified in a user created `.yaml` configuration file.
See [`casanovo/config.yaml`](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) for the default configuration that was used to obtain the reported results.


### Model weights

When running Casanovo in `denovo` or `eval` mode, Casanovo needs compatible pretrained model weights to make predictions.
Model weights can be found on the [Releases page](https://github.com/Noble-Lab/casanovo/releases) under the "Assets" for each release (file extension: .ckpt).
The model file can then be specified using the `--model` command-line parameter when executing Casanovo.
To assist users, if no model file is specified Casanovo will try to download and use a compatible model file automatically.

Not all releases might have a model file included on the [Releases page](https://github.com/Noble-Lab/casanovo/releases), in which case model weights for alternative releases with the same major version number can be used.

### Example commands

- To run _de novo_ sequencing:

```
casanovo --mode=denovo --model=path/to/pretrained.ckpt --peak_path=path/to/predict/spectra.mgf --config=path/to/config.yaml --output=path/to/output
casanovo --mode=denovo --peak_path=path/to/predict/spectra.mgf --output=path/to/output
```

Casanovo can predict peptide sequences for MS/MS data in mzML, mzXML, and MGF files.
Expand All @@ -74,15 +82,15 @@ This will write peptide predictions for the given MS/MS spectra to the specified
- To evaluate _de novo_ sequencing performance based on known spectrum annotations:

```
casanovo --mode=eval --model=path/to/pretrained.ckpt --peak_path=path/to/test/annotated_spectra.mgf --config=path/to/config.yaml
casanovo --mode=eval --peak_path=path/to/test/annotated_spectra.mgf
```

To evaluate the peptide predictions, ground truth peptide labels need to be provided as an annotated MGF file.

- To train a model from scratch:

```
casanovo --mode=train --peak_path=path/to/train/annotated_spectra.mgf --peak_path_val=path/to/validation/annotated_spectra.mgf --config=path/to/config.yaml
casanovo --mode=train --peak_path=path/to/train/annotated_spectra.mgf --peak_path_val=path/to/validation/annotated_spectra.mgf
```

Training and validation MS/MS data need to be provided as annotated MGF files.
Expand All @@ -95,16 +103,13 @@ We will demonstrate how to use Casanovo using a small walkthrough example on a s
The example MGF file is available at [`sample_data/sample_preprocessed_spectra.mgf`](https://github.com/Noble-Lab/casanovo/blob/main/sample_data/sample_preprocessed_spectra.mgf`).

1. Install Casanovo (see above for details).
2. Download the `casanovo_pretrained_model_weights.zip` from [Zenodo](https://zenodo.org/record/6791263). Place these models in a location that you can easily access and know the path of.
- We will be `using pretrained_excl_mouse.ckpt` for this job.
3. Copy the example `config.yaml` file into a location you can easily access.
4. Ensure you are in the proper anaconda environment by typing `conda activate casanovo_env`. (If you named your environment differently, type in that name instead.)
5. Run this command:
2. Ensure you are in the proper anaconda environment by typing `conda activate casanovo_env`. (If you named your environment differently, type in that name instead.)
3. Run this command:
```
casanovo --mode=denovo --model=[PATH_TO]/pretrained_excl_mouse.ckpt --peak_path=[PATH_TO]/sample_preprocessed_spectra.mgf --config=[PATH_TO]/config.yaml
casanovo --mode=denovo --peak_path=[PATH_TO]/sample_preprocessed_spectra.mgf
```
Make sure you use the proper filepath to the `pretrained_excl_mouse.ckpt` file.
- Note: If you want to get the output mzTab file in different location than the working directory, specify an alternative output location using the `--output` parameter.

Note: If you want to store the output mzTab file in a different location than the current working directory, specify an alternative output location using the `--output` parameter.

This job will take very little time to run (< 1 minute).

Expand All @@ -127,8 +132,29 @@ Run the following command in your command prompt to see all possible command-lin
casanovo --help
```

Additionally, you can use a configuration file to fully customize Casanovo.
You can find the `config.yaml` configuration file that is used by default [here](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml).

**I get a "CUDA out of memory" error when trying to run Casanovo. Help!**

This means that there was not enough (free) memory available on your GPU to run Casanovo, which is especially likely to happen when you are using a smaller, consumer-grade GPU.
We recommend trying to decrease the `train_batch_size` or `predict_batch_size` options in the [config file](https://github.com/Noble-Lab/casanovo/blob/main/casanovo/config.yaml) (depending on whether the error occurred during `train` or `denovo` mode) to reduce the number of spectra that are processed simultaneously.
Additionally, we recommend shutting down any other processes that may be running on the GPU, so that Casanovo can exclusively use the GPU.

**How do I solve a "PermissionError: GitHub API rate limit exceeded" error when trying to run Casanovo?**

When running Casanovo in `denovo` or `eval` mode, Casanovo needs compatible pretrained model weights to make predictions.
If no model weights file is specified using the `--model` command-line parameter, Casanovo will automatically try to download the latest compatible model file from GitHub and save it to its cache for subsequent use.
However, the GitHub API is limited to maximum 60 requests per hour per IP address.
Consequently, if Casanovo has been executed multiple times already, it might temporarily not be able to communicate with GitHub.
You can avoid this error by explicitly specifying the model file using the `--model` parameter.

**I see "NotImplementedError: The operator 'aten::index.Tensor'..." when using a Mac with an Apple Silicon chip.**

Casanovo can leverage Apple's Metal Performance Shaders (MPS) on newer Mac computers, which requires that the `PYTORCH_ENABLE_MPS_FALLBACK` is set to `1`:

``` sh
export PYTORCH_ENABLE_MPS_FALLBACK=1
```

This will need to be set with each new shell session, or you can add it to your `.bashrc` / `.zshrc` to set this environment variable by default.
139 changes: 132 additions & 7 deletions casanovo/casanovo.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,27 @@
"""The command line entry point for Casanovo."""
import datetime
import functools
import logging
import os
import re
import shutil
import sys
import warnings
from typing import Optional, Tuple

warnings.filterwarnings("ignore", category=DeprecationWarning)

import appdirs
import click
import psutil
import github
import pytorch_lightning as pl
import requests
import torch
import tqdm
import yaml

from . import __version__
from . import utils
from .data import ms_io
from .denovo import model_runner

Expand Down Expand Up @@ -61,11 +72,11 @@
)
def main(
mode: str,
model: str,
model: Optional[str],
peak_path: str,
peak_path_val: str,
config: str,
output: str,
peak_path_val: Optional[str],
config: Optional[str],
output: Optional[str],
):
"""
\b
Expand Down Expand Up @@ -105,10 +116,12 @@ def main(
root.addHandler(file_handler)
# Disable dependency non-critical log messages.
logging.getLogger("depthcharge").setLevel(logging.INFO)
logging.getLogger("github").setLevel(logging.WARNING)
logging.getLogger("h5py").setLevel(logging.WARNING)
logging.getLogger("numba").setLevel(logging.WARNING)
logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)
logging.getLogger("torch").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)

# Read parameters from the config file.
if config is None:
Expand Down Expand Up @@ -163,13 +176,30 @@ def main(
}
# Add extra configuration options and scale by the number of GPUs.
n_gpus = torch.cuda.device_count()
config["n_workers"] = len(psutil.Process().cpu_affinity())
config["n_workers"] = utils.n_workers()
if n_gpus > 1:
config["n_workers"] = config["n_workers"] // n_gpus
config["train_batch_size"] = config["train_batch_size"] // n_gpus

pl.utilities.seed.seed_everything(seed=config["random_seed"], workers=True)

# Download model weights if these were not specified (except when training).
if model is None and mode != "train":
try:
model = _get_model_weights()
except github.RateLimitExceededException:
logger.error(
"GitHub API rate limit exceeded while trying to download the "
"model weights. Please download compatible model weights "
"manually from the official Casanovo code website "
"(https://github.com/Noble-Lab/casanovo) and specify these "
"explicitly using the `--model` parameter when running "
"Casanovo."
)
raise PermissionError(
"GitHub API rate limit exceeded while trying to download the "
"model weights"
) from None

# Log the active configuration.
logger.info("Casanovo version %s", str(__version__))
logger.debug("mode = %s", mode)
Expand Down Expand Up @@ -198,5 +228,100 @@ def main(
model_runner.train(peak_path, peak_path_val, model, config)


def _get_model_weights() -> str:
"""
Use cached model weights or download them from GitHub.

If no weights file (extension: .ckpt) is available in the cache directory,
it will be downloaded from a release asset on GitHub.
Model weights are retrieved by matching release version. If no model weights
for an identical release (major, minor, patch), alternative releases with
matching (i) major and minor, or (ii) major versions will be used.
If no matching release can be found, no model weights will be downloaded.

Note that the GitHub API is limited to 60 requests from the same IP per
hour.

Returns
-------
str
The name of the model weights file.
"""
cache_dir = appdirs.user_cache_dir("casanovo", False, opinion=False)
os.makedirs(cache_dir, exist_ok=True)
version = utils.split_version(__version__)
version_match: Tuple[Optional[str], Optional[str], int] = None, None, 0
# Try to find suitable model weights in the local cache.
for filename in os.listdir(cache_dir):
root, ext = os.path.splitext(filename)
if ext == ".ckpt":
file_version = tuple(
g for g in re.match(r".*_v(\d+)_(\d+)_(\d+)", root).groups()
)
match = sum([i == j for i, j in zip(version, file_version)])
if match > version_match[2]:
version_match = os.path.join(cache_dir, filename), None, match
# Provide the cached model weights if found.
if version_match[2] > 0:
logger.info(
"Model weights file %s retrieved from local cache",
version_match[0],
)
return version_match[0]
# Otherwise try to find compatible model weights on GitHub.
else:
repo = github.Github().get_repo("Noble-Lab/casanovo")
# Find the best matching release with model weights provided as asset.
for release in repo.get_releases():
rel_version = tuple(
g
for g in re.match(
r"v(\d+)\.(\d+)\.(\d+)", release.tag_name
).groups()
)
match = sum([i == j for i, j in zip(version, rel_version)])
if match > version_match[2]:
for release_asset in release.get_assets():
fn, ext = os.path.splitext(release_asset.name)
if ext == ".ckpt":
version_match = (
os.path.join(
cache_dir,
f"{fn}_v{'_'.join(map(str, rel_version))}{ext}",
),
release_asset.browser_download_url,
match,
)
break
# Download the model weights if a matching release was found.
if version_match[2] > 0:
filename, url, _ = version_match
logger.info(
"Downloading model weights file %s from %s", filename, url
)
r = requests.get(url, stream=True, allow_redirects=True)
r.raise_for_status()
file_size = int(r.headers.get("Content-Length", 0))
desc = "(Unknown total file size)" if file_size == 0 else ""
r.raw.read = functools.partial(r.raw.read, decode_content=True)
with tqdm.tqdm.wrapattr(
r.raw, "read", total=file_size, desc=desc
) as r_raw, open(filename, "wb") as f:
shutil.copyfileobj(r_raw, f)
return filename
else:
logger.error(
"No matching model weights for release v%s found, please "
"specify your model weights explicitly using the `--model` "
"parameter",
__version__,
)
raise ValueError(
f"No matching model weights for release v{__version__} found, "
f"please specify your model weights explicitly using the "
f"`--model` parameter"
)


if __name__ == "__main__":
main()
4 changes: 2 additions & 2 deletions casanovo/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ random_seed: 454

# Spectrum processing options.
n_peaks: 150
min_mz: 50.52564895 # 1.0005079 * 50.5
min_mz: 50.0
max_mz: 2500.0
min_intensity: 0.01
remove_precursor_tol: 2.0 # Da
Expand All @@ -21,7 +21,7 @@ dim_model: 512
n_head: 8
dim_feedforward: 1024
n_layers: 9
dropout: 0
dropout: 0.0
dim_intensity:
custom_encoder:
max_length: 100
Expand Down
2 changes: 1 addition & 1 deletion casanovo/denovo/dataloaders.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def __init__(
test_index: Optional[AnnotatedSpectrumIndex] = None,
batch_size: int = 128,
n_peaks: Optional[int] = 150,
min_mz: float = 140.0,
min_mz: float = 50.0,
max_mz: float = 2500.0,
min_intensity: float = 0.01,
remove_precursor_tol: float = 2.0,
Expand Down
4 changes: 2 additions & 2 deletions casanovo/denovo/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,10 @@ class Spec2Pep(pl.LightningModule, ModelMixin):

def __init__(
self,
dim_model: int = 128,
dim_model: int = 512,
n_head: int = 8,
dim_feedforward: int = 1024,
n_layers: int = 1,
n_layers: int = 9,
dropout: float = 0.0,
dim_intensity: Optional[int] = None,
custom_encoder: Optional[SpectrumEncoder] = None,
Expand Down
Loading