LIGYSIS pipeline for user job submission

This repository contains a customised version of our original ligand site analysis LIGYSIS pipeline, which was employed to analyse all biologically relevant protien-ligand interactions on the PDBe [1], which results are served in the LIGYSIS web server. The code for the web server can be found here.

This customised version of LIGYSIS does not rely on the PDBe-KB [2] and PDBe APIs [3], but instead processes from scratch any set of structures in .pdb or .cif formats, and is employed to process user jobs in the LIGYSIS web server.

Pipeline methodology

The pipeline can be summarised in the following steps:

Entry, protein and gene names extraction from UniProt using the user-provided UniProt [4] accession.

Note: Empty strings are returned if the user provided unknown as the uniprot_id argument (only to be used if the structures are from protein not present in UniProt).
Structure cleaning using clean_pdb.py script.
Structural superimposition using STAMP [5].
Simplification of superposed files. This consists in keeping protein atoms only for one of the superposed structures, and heteroatoms for the rest. This is done to generate a lower-weight superposition (all ligands to a single protein scaffold).
Mapping of PDB residues to UniProt by means of a pairwise alignment: protein chain sequence to UniProt sequence associated to user-provided UniProt accession.

Note: If uniprot_id is unknown, this will step will generate a pseudo-mapping where PDB residue numbers are mapped to themselves. For the programme to work correctly, if the protein is not in UniProt, structures should present the same numbering scheme.
Relative solvent accessibility (RSA) [6] and secondary structure elements calculation with DSSP [7] via ProIntVar [8].
Protein-ligand interactions calculation with pdbe-rpeggio [9].
Ligand clustering into binding sites using SciPy [10].
Generation of ChimeraX [11] visualisation scripts.
Multiple sequence alignment with jackhmmer [12].
Shenkin amino acid divergence score calculation [13, 14].
Missense enrichment score calculation with VarAlign [15, 16].
RSA-based clustering label and functional score calculation [17].

Note: this is done on a separate script: predict_rsa_labels.py, which requires a different environment due to dependencies crashing. This programme must be executed on the DEEP_LEAERNING environment.

The final output of the pipeline consists of multiple tables collating the results from the different steps of the analysis for each residue, and for the defined ligand binding sites. These data include relative solvent accessibility (RSA), secondary structure, PDB/UniProt residue number, alignment column, divergence score, missense enrichment score, p-value, etc.

Dependencies

Third party dependencies for these notebooks include:

Other standard python libraries:

BioPython (BSD 3-Clause License) [18]
Keras (Apache v2.0 License) [19]
NumPy (BSD 3-Clause License) [20]
Pandas (BSD 3-Clause License) [21]
SciPy (BSD 3-Clause License)
Scikit-learn (BSD 3-Clause License) [22]
TensorFlow (Apache v2.0 License) [23]

For more information on the dependencies, refer to the .yml files in the ENVS directory. To install all the dependencies, refer to the installation manual.

Environments

The ENVS folder contains three .yml files describing the necessary packages and dependencies for the different parts of the pipeline and analysis.

LIGYSIS is needed to run LIGYSIS.
DEEP_LEARNING contains the packages necessary to do predict the RSA Cluster labels and functional scores with predict_rsa_labels.py.

Note that there are no .yml files for the ARPEGGIO, CLEAN_PDB, DSSP, HMMER environments, as these are created from the command line without the need of a .yml file.

Installation

For complete installation instructions refer here.

Downloading SwissProt

This is the database used for our analysis, but can be changed according to the user purposes, e.g. TrEMBL. What is important is to add the correct path in the configuration file. To download SwissProt, follow the next steps.

# download SwissProt in fasta format (88MB)
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

# decrompress the file
gzip -d uniprot_sprot.fasta.gz

Downloading gnomAD v2.1

This is the database used for our analysis, but can be changed according to the user purposes, e.g. v > 2.1. What is important is to add the correct path in the configuration file. To download gnomAD v2.1 [24], follow the next steps.

# download gnomAD Exomves vcf (large file 58GB)
wget https://storage.googleapis.com/gcp-public-data--gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz

For more information, refer to gnomAD.

After downloading gnomAD, it is required to run VEP [25] on it, as VarAlign uses its annotations. Information on how to do so here.

Execution

LIGYSIS can be run like this:

python ligysis_custom.py IN/Q9UGL1_cif Q9UGL1 pdb

This needs to be done within the LIGYSIS environment.

The programme uses relative paths, so it is recommended to run it in the repository directory, where it can directly read from ./IN and write output to ./OUT (you need to create this second directory).

The programme has three mandatory arguments:

input_dir is the input directory containing the set of structures to be analysed in either PDB (.pdb, .ent) or mmCIF (.cif) formats. In this example:IN/Q9UGL1_cif.
uniprot_id is the corresponding UniProt accession identifier of the structures within input_dir. In this example: Q9UGL1, which corresponds to human Lysine-specific demethylase 5B.
Finally, struc_fmt, indicating whether the structures are in pdb or mmcif format. Only these two formats are selected, and the programme will not run properly unless a structure format is required.

To carry out the last step and add the RSA-derived Cluster labels and functional scores, the predict_rsa_labels.py needs to be executed on the DEEP_LEARNING environment. This is how to run it:

python predict_rsa_labels.py Q9UGL1_cif

This script only requires a single mandatory argument, which is input_dir the name of the input directory. From this input_dir, the programme will find the relevant files in the ./OUT directory. In this example, it is Q9UGL1_cif. This script will use a multilayer perceptron model [6] to predict RSA-based Cluster labels and functional scores for each defined binding site.

Help and manual

To get help or information about the LIGYSIS pipeline, run:

python ligysis_custom.py -h

which will print the manual of the programme:

usage: ligysis_custom.py [-h] [--override] [--override_variants] [--variants]
                         [--clust_method CLUST_METHOD]
                         [--clust_dist CLUST_DIST] [--hmm_iters HMM_ITERS]
                         [--cons_thresh_high CONS_THRESH_HIGH]
                         [--cons_thresh_low CONS_THRESH_LOW]
                         [--mes_thresh MES_THRESH]
                         input_dir uniprot_id {pdb,mmcif}

LIGYSIS: a ligand binding site analysis pipeline that clusters ligands,
defines, and characterises binding sites.

positional arguments:
  input_dir             Path to directory containing input structures
  uniprot_id            UniProt ID of the protein
  {pdb,mmcif}           Format of the input structures (must be 'pdb' or
                        'mmcif')

optional arguments:
  -h, --help            show this help message and exit
  --override            Override any previously generated files.
  --override_variants   Override any previously generated files (ONLY VARIANTS
                        SECTION).
  --variants            Retrieves Human variants from MSA and generates
                        tables.
  --clust_method CLUST_METHOD
                        Ligand clustering method (default: average)
  --clust_dist CLUST_DIST
                        Ligand clustering distance threshold (default: 0.50)
  --hmm_iters HMM_ITERS
                        Number of iterations for JACKHMMER (default: 3)
  --cons_thresh_high CONS_THRESH_HIGH
                        Conservation high threshold (default: 75)
  --cons_thresh_low CONS_THRESH_LOW
                        Conservation low threshold (default: 25)
  --mes_thresh MES_THRESH
                        MES threshold (default: 1.0)

To get help or information about the RSA-label and scores prediction script, run:

python predict_rsa_labels.py -h

which will print the manual of the programme:

usage: predict_rsa_labels.py [-h] input_id

This script predicts RSA cluster labels and calculates RSA-based functional
score (FS)

positional arguments:
  input_dir    This is the Input ID, Job ID or input directory, i.e..
              name of the directory where the binding site table resides.

options:
  -h, --help  show this help message and exit

Optional command line arguments

--clust_method is the clustering algorithm employed to cluster the ligands into binding sites. Average linkage clustering, average, is used by default. For other methods, check scipy.cluster.hierarchy documentation here.
--clust_dist is the distance thredhold at which the clustering tree or dendrogram is cut to obtain ligand clusters or ligand binding sites. The threshold value is 0.5 by default, i.e., on average the ligands within a cluster share half of their binding residues.
--hmm_iters is the number of iterations for the MSA build using jackHMMER, default is 3. More iterations will retrieve more remote homologues, sequences with an evolutionary link further away in time (less sequence similarity).
--cons_thresh_low and --cons_thresh_high are the thresholds employed for the Normalised Shenkin divergence score [2, 6], ranging 0-100, to determine the classes of conserved and unconserved. Default values are 25 and 75, so only columns presenting ≤ 25% of the maximum divergence will be classified as conserved and columns with a divergence ≥75% as unconserved or divergent.
--mesh_thresh is the threshold employed to classify columns into enriched or depleted in human missense variation, relative to the average of the other columns. The missense enrichment score, MES, is an odds ratio [2, 3, 5], so values > 1 indicate enrichment and < 1 depletion. A values of 1 represents average variation or neutrality within a column. Default threshold is 1.0.

Citation

If you use LIGYSIS pipeline, please cite:

Utgés JS, MacGowan SA, Ives CM, Barton GJ. Classification of likely functional class for ligand binding sites identified from fragment screening. Commun Biol. 2024 Mar 13;7(1):320. doi: 10.1038/s42003-024-05970-8. PMID: 38480979; PMCID: PMC10937669.

Utgés JS, Barton GJ. Comparative evaluation of methods for the prediction of protein-ligand binding sites. J Cheminform. 2024 Nov 11;16(1):126. doi: 10.1186/s13321-024-00923-z. PMID: 39529176; PMCID: PMC11552181.

References

Armstrong DR, Berrisford JM, Conroy MJ, Gutmanas A, Anyango S, Choudhary P, Clark AR, Dana JM, Deshpande M, Dunlop R, Gane P, Gáborová R, Gupta D, Haslam P, Koča J, Mak L, Mir S, Mukhopadhyay A, Nadzirin N, Nair S, Paysan-Lafosse T, Pravda L, Sehnal D, Salih O, Smart O, Tolchard J, Varadi M, Svobodova-Vařeková R, Zaki H, Kleywegt GJ, Velankar S. PDBe: improved findability of macromolecular structure data in the PDB. Nucleic Acids Res. 2020 Jan 8;48(D1):D335-D343. doi: 10.1093/nar/gkz990. PMID: 31691821; PMCID: PMC7145656.
PDBe-KB consortium. PDBe-KB: a community-driven resource for structural and functional annotations. Nucleic Acids Res. 2020 Jan 8;48(D1):D344-D353. doi: 10.1093/nar/gkz853. PMID: 31584092; PMCID: PMC6943075.
Nair S, Váradi M, Nadzirin N, Pravda L, Anyango S, Mir S, Berrisford J, Armstrong D, Gutmanas A, Velankar S. PDBe aggregated API: programmatic access to an integrative knowledge graph of molecular structure data. Bioinformatics. 2021 Nov 5;37(21):3950-3952. doi: 10.1093/bioinformatics/btab424. PMID: 34081107; PMCID: PMC8570819.
UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021 Jan 8;49(D1):D480-D489. doi: 10.1093/nar/gkaa1100. PMID: 33237286; PMCID: PMC7778908.
Russell RB, Barton GJ. Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins. 1992 Oct;14(2):309-23. doi: 10.1002/prot.340140216. PMID: 1409577.
Tien MZ, Meyer AG, Sydykova DK, Spielman SJ, Wilke CO. Maximum allowed solvent accessibilites of residues in proteins. PLoS One. 2013 Nov 21;8(11):e80635. doi: 10.1371/journal.pone.0080635. PMID: 24278298; PMCID: PMC3836772.
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983 Dec;22(12):2577-637. doi: 10.1002/bip.360221211. PMID: 6667333.
MacGowan SA, Madeira F, Britto-Borges T, Warowny M, Drozdetskiy A, Procter JB, Barton GJ. The Dundee Resource for Sequence Analysis and Structure Prediction. Protein Sci. 2020 Jan;29(1):277-297. doi: 10.1002/pro.3783. Epub 2019 Nov 28. PMID: 31710725; PMCID: PMC6933851.
Jubb HC, Higueruelo AP, Ochoa-Montaño B, Pitt WR, Ascher DB, Blundell TL. Arpeggio: A Web Server for Calculating and Visualising Interatomic Interactions in Protein Structures. J Mol Biol. 2017 Feb 3;429(3):365-371. doi: 10.1016/j.jmb.2016.12.004. Epub 2016 Dec 10. PMID: 27964945; PMCID: PMC5282402.
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P; SciPy 1.0 Contributors. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3. Erratum in: Nat Methods. 2020 Mar;17(3):352. doi: 10.1038/s41592-020-0772-5. PMID: 32015543; PMCID: PMC7056644.
Pettersen EF, Goddard TD, Huang CC, Meng EC, Couch GS, Croll TI, Morris JH, Ferrin TE. UCSF ChimeraX: Structure visualization for researchers, educators, and developers. Protein Sci. 2021 Jan;30(1):70-82. doi: 10.1002/pro.3943. Epub 2020 Oct 22. PMID: 32881101; PMCID: PMC7737788.
Eddy SR. Profile hidden Markov models. Bioinformatics. 1998;14(9):755-63. doi: 10.1093/bioinformatics/14.9.755. PMID: 9918945.
Shenkin PS, Erman B, Mastrandrea LD. Information-theoretical entropy as a measure of sequence variability. Proteins. 1991; 11(4):297–313. Epub 1991/01/01. https://doi.org/10.1002/prot.340110408 PMID: 1758884.
Utgés JS, Tsenkov MI, Dietrich NJM, MacGowan SA, Barton GJ. Ankyrin repeats in context with human population variation. PLoS Comput Biol. 2021 Aug 24;17(8):e1009335. doi: 10.1371/journal.pcbi.1009335. PMID: 34428215; PMCID: PMC8415598.
MacGowan, SA, Madeira, F, Britto-Borges, T, Schmittner, MS, Cole, C, & Barton, GJ (2017). Human missense variation is constrained by domain structure and highlights functional and pathogenic residues. bioRxiv, 127050. https://doi.org/10.1101/127050.
MacGowan SA, Madeira F, Britto-Borges T, Barton GJ. A unified analysis of evolutionary and population constraint in protein domains highlights structural features and pathogenic sites. Commun Biol. 2024 Apr 11;7(1):447. doi: 10.1038/s42003-024-06117-5. PMID: 38605212; PMCID: PMC11009406.
Utgés JS, MacGowan SA, Ives CM, Barton GJ. Classification of likely functional class for ligand binding sites identified from fragment screening. Commun Biol. 2024 Mar 13;7(1):320. doi: 10.1038/s42003-024-05970-8. PMID: 38480979; PMCID: PMC10937669.
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009 Jun 1;25(11):1422-3. doi: 10.1093/bioinformatics/btp163. Epub 2009 Mar 20. PMID: 19304878; PMCID: PMC2682512.
Chollet, F., & others. (2015). Keras. Retrieved from https://keras.io.
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, Del Río JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE. Array programming with NumPy. Nature. 2020 Sep;585(7825):357-362. doi: 10.1038/s41586-020-2649-2. Epub 2020 Sep 16. PMID: 32939066; PMCID: PMC7759461.
McKinney, W. (2010). Data Structures for Statistical Computing in Python. In S. van der Walt & J. Millman (Eds.), Proceedings of the 9th Python in Science Conference (pp. 56–61). doi: 10.25080/Majora-92bf1922-00a
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12(85), 2825–2830. https://www.jmlr.org/papers/v12/pedregosa11a.html.
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Rafal Jozefowicz, Yangqing Jia, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Mike Schuster, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Paper here Software available from tensorflow.org.
Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, Gauthier LD, Brand H, Solomonson M, Watts NA, Rhodes D, Singer-Berk M, England EM, Seaby EG, Kosmicki JA, Walters RK, Tashman K, Farjoun Y, Banks E, Poterba T, Wang A, Seed C, Whiffin N, Chong JX, Samocha KE, Pierce-Hoffman E, Zappala Z, O'Donnell-Luria AH, Minikel EV, Weisburd B, Lek M, Ware JS, Vittal C, Armean IM, Bergelson L, Cibulskis K, Connolly KM, Covarrubias M, Donnelly S, Ferriera S, Gabriel S, Gentry J, Gupta N, Jeandet T, Kaplan D, Llanwarne C, Munshi R, Novod S, Petrillo N, Roazen D, Ruano-Rubio V, Saltzman A, Schleicher M, Soto J, Tibbetts K, Tolonen C, Wade G, Talkowski ME; Genome Aggregation Database Consortium; Neale BM, Daly MJ, MacArthur DG. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020 May;581(7809):434-443. doi: 10.1038/s41586-020-2308-7. Epub 2020 May 27. Erratum in: Nature. 2021 Feb;590(7846):E53. doi: 10.1038/s41586-020-03174-8. Erratum in: Nature. 2021 Sep;597(7874):E3-E4. doi: 10.1038/s41586-021-03758-y. PMID: 32461654; PMCID: PMC7334197.
McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GR, Thormann A, Flicek P, Cunningham F. The Ensembl Variant Effect Predictor. Genome Biol. 2016 Jun 6;17(1):122. doi: 10.1186/s13059-016-0974-4. PMID: 27268795; PMCID: PMC4893825.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
ENVS		ENVS
IN		IN
OTHER		OTHER
OUT		OUT
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
clean_pdb.py		clean_pdb.py
ligysis_config.txt		ligysis_config.txt
ligysis_custom.py		ligysis_custom.py
predict_rsa_labels.py		predict_rsa_labels.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LIGYSIS pipeline for user job submission

Pipeline methodology

Dependencies

Environments

Installation

Downloading SwissProt

Downloading gnomAD v2.1

Execution

Help and manual

Optional command line arguments

Citation

References

About

Releases 1

Packages

Languages

License

bartongroup/LIGYSIS-custom

Folders and files

Latest commit

History

Repository files navigation

LIGYSIS pipeline for user job submission

Pipeline methodology

Dependencies

Environments

Installation

Downloading SwissProt

Downloading gnomAD v2.1

Execution

Help and manual

Optional command line arguments

Citation

References

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages