NetTCR-2.1 is a deep learning model used to predict TCR specificity. NetTCR-2.1 uses convolutional neural networks (CNN) to predict whether a given TCR binds a specific peptide. The NetTCR-2.1 publication is available at https://www.frontiersin.org/articles/10.3389/fimmu.2022.1055151/full.
The scripts in this repo allow training and testing of models. It is possible to train/test using CDR3 only (with train_nettcr_cdr3.py
and test_nettcr_cdr3.py
) or all the CDRs (with train_nettcr_cdr123.py
and test_nettcr_cdr123.py
).
The input datasets shoud contain the CDRs and peptide sequences. For the CDR3 training/testing, at least the columns A3
, B3
should be present (with headers). For CDR123, the columns should be A1
,A2
,A3
, B1
, B2
, B3
. All the input files shoud be comma-separated.
See data/GILGFVFTL/train.csv
as an exampl of a CDR123 dataset.
NB! Since the NetTCR models are peptide-specific, the peptide sequence is not needed in the input file. Make sure that all the TCRs in the input file refer to the same peptide.
The folder data/
contains the data used to train/validate/test NetTCR-2.1. Th data file contains information about the 6 CDR loops, the V/J genes, the target peptide and HLA. The positive data was retrieved from IEDB, VDJdb 10X genomics and McPAS datasets; the negative data comes from 10X (denoted as true_neg
) or is generated by mismatching positive TCRs and peptide (denoted as swapped_neg
).
The redundancy in the dataset was reduced using Hobohm1 algorithm [1], using the kernel similarity [2] measure and a similarity threshold of 0.95. Thus training, validation and test dataset do not share similar TCR sequences (up to 0.95 similarity threshold).
First, install the conda environment running conda env create -f environment.yml
. This will create a conda environment called nettcr_env
with the necessary dependencies.
The inputs files for the training scripts are the training dataset and the validation data, used for early stopping.
Example:
python src/train_nettcr_cdr3.py --train_data data/RAKFKQLL/train.csv --val_data data/RAKFKQLL/validation.csv --outdir out/<model_name>/
This will generate and save a .pt
file with the the traiend model. The directory has to be specified with the option --outdir
.
The other input arguments to the script are --epochs
, --learning_rate
, --verbose
. If a GPU is available, the scritp will detect it and use it.
The test scripts can be used to make predictions of test TCRs, using a pre-trained model.
Example:
python src/test_nettcr_cdr3.py --test_data data/RAKFKQLL/test.csv --trained_model out/<model_name>/trained_model_cdr3_ab.pt --outdir out/<model_name>/
This will generate and save a .csv
file with the prediction. The file will be saved in the specified output directory.
The folder pretrained_models
contains the models from [3]. The pretrained models refer both to the NetTCR-2.1 CDR3 and CDR123 architectures, with paired alpha and beta chains. For each network configuration, the peptide-specific models are provided. For each peptide, the network was trained using 5-fold nested cross-validation; this results in 20 models per peptide. The final prediction score is given by an average of the 20 predictions.
The followig example shows hot to test the pretrained models.
python src/test_pretrained_cdr3.py --test_data data/RAKFKQLL/test.csv --trained_models_dir pretrained_models/cdr3_pep/RAKFKQLL/ --outdir <path/for/prediction/file>
NB! NetTCR-2.1 is a peptide-specific model. Make sure that the pretrained model and the test data refer to the same peptide.
[1] Hobohm, Uwe, et al. "Selection of representative protein data sets." Protein Science 1.3 (1992): 409-417.
[2] Shen, Wen-Jun, et al. "Towards a mathematical foundation of immunology and amino acid chains." arXiv preprint arXiv:1205.6031 (2012).
[3] Montemurro, Alessandro, et al. "NetTCR-2.1: Lessons and guidance on how to develop models for TCR specificity predictions." Frontiers in Immunology Volume 13 (2022).