NetTCR-2.1 - Sequence-based prediction of peptide-TCR interacions using CDR1, CDR2 and CDR3 loops

NetTCR-2.1 is a deep learning model used to predict TCR specificity. NetTCR-2.1 uses convolutional neural networks (CNN) to predict whether a given TCR binds a specific peptide. The NetTCR-2.1 publication is available at https://www.frontiersin.org/articles/10.3389/fimmu.2022.1055151/full.

The scripts in this repo allow training and testing of models. It is possible to train/test using CDR3 only (with train_nettcr_cdr3.py and test_nettcr_cdr3.py) or all the CDRs (with train_nettcr_cdr123.py and test_nettcr_cdr123.py).

Data

The input datasets shoud contain the CDRs and peptide sequences. For the CDR3 training/testing, at least the columns A3, B3 should be present (with headers). For CDR123, the columns should be A1,A2,A3, B1, B2, B3. All the input files shoud be comma-separated.

See data/GILGFVFTL/train.csv as an exampl of a CDR123 dataset.

NB! Since the NetTCR models are peptide-specific, the peptide sequence is not needed in the input file. Make sure that all the TCRs in the input file refer to the same peptide.

The folder data/contains the data used to train/validate/test NetTCR-2.1. Th data file contains information about the 6 CDR loops, the V/J genes, the target peptide and HLA. The positive data was retrieved from IEDB, VDJdb 10X genomics and McPAS datasets; the negative data comes from 10X (denoted as true_neg) or is generated by mismatching positive TCRs and peptide (denoted as swapped_neg).

The redundancy in the dataset was reduced using Hobohm1 algorithm [1], using the kernel similarity [2] measure and a similarity threshold of 0.95. Thus training, validation and test dataset do not share similar TCR sequences (up to 0.95 similarity threshold).

Environment setup

First, install the conda environment running conda env create -f environment.yml. This will create a conda environment called nettcr_env with the necessary dependencies.

Network training

The inputs files for the training scripts are the training dataset and the validation data, used for early stopping.

Example:

python src/train_nettcr_cdr3.py --train_data data/RAKFKQLL/train.csv --val_data data/RAKFKQLL/validation.csv --outdir out/<model_name>/

This will generate and save a .pt file with the the traiend model. The directory has to be specified with the option --outdir.

The other input arguments to the script are --epochs, --learning_rate, --verbose. If a GPU is available, the scritp will detect it and use it.

Network testing

The test scripts can be used to make predictions of test TCRs, using a pre-trained model.

Example:

python src/test_nettcr_cdr3.py --test_data data/RAKFKQLL/test.csv --trained_model out/<model_name>/trained_model_cdr3_ab.pt --outdir out/<model_name>/

This will generate and save a .csv file with the prediction. The file will be saved in the specified output directory.

Pre-trained models

The folder pretrained_models contains the models from [3]. The pretrained models refer both to the NetTCR-2.1 CDR3 and CDR123 architectures, with paired alpha and beta chains. For each network configuration, the peptide-specific models are provided. For each peptide, the network was trained using 5-fold nested cross-validation; this results in 20 models per peptide. The final prediction score is given by an average of the 20 predictions. The followig example shows hot to test the pretrained models.

python src/test_pretrained_cdr3.py --test_data data/RAKFKQLL/test.csv --trained_models_dir pretrained_models/cdr3_pep/RAKFKQLL/ --outdir <path/for/prediction/file>

NB! NetTCR-2.1 is a peptide-specific model. Make sure that the pretrained model and the test data refer to the same peptide.

References

[1] Hobohm, Uwe, et al. "Selection of representative protein data sets." Protein Science 1.3 (1992): 409-417.

[2] Shen, Wen-Jun, et al. "Towards a mathematical foundation of immunology and amino acid chains." arXiv preprint arXiv:1205.6031 (2012).

[3] Montemurro, Alessandro, et al. "NetTCR-2.1: Lessons and guidance on how to develop models for TCR specificity predictions." Frontiers in Immunology Volume 13 (2022).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
pretrained_models		pretrained_models
src		src
.DS_Store		.DS_Store
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NetTCR-2.1 - Sequence-based prediction of peptide-TCR interacions using CDR1, CDR2 and CDR3 loops

Data

Environment setup

Network training

Network testing

Pre-trained models

References

About

Releases

Packages

Contributors 2

Languages

mnielLab/NetTCR-2.1

Folders and files

Latest commit

History

Repository files navigation

NetTCR-2.1 - Sequence-based prediction of peptide-TCR interacions using CDR1, CDR2 and CDR3 loops

Data

Environment setup

Network training

Network testing

Pre-trained models

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages