Kyle I S Harrington1,*, James Chin 2
1 Brandeis, Tufts, Harvard Med.
2 Brandeis, Dartmouth.
* Corresponding author: [email protected]
To date, most efforts to predict B-cell epitopes have been based upon linear sequence data, which precludes many structures with intertwined surfaces. Existing work on the prediction of discontinuous B-cell epitopes has revealed a number of computable features that help predict epitope binding affinity. We extend this research by discovering complex high-level features-based based upon physical characteristics.
B-cell epitopes are regions of molecules recognized by antibodies of the immune system. Characterization of B-cell epitopes is a crucial step for understanding the immunological basis of bio-recognition. Knowledge of the locations of B-cell epitopes can aid the development of peptide vaccines or used to induce the production of antibodies for diagnostic and therapeutic applications.
Existing methods to predict the location of continuous epitopes and patterns in proteins are based on the reported correlation between physicochemical properties of amino acids and the locations of linear B-cell epitopes within protein sequences . Physicochemical properties such as hydrophilicity, flexibility, turns or solvent accessibility were used in BEPITOPE (Odorico and Pellequer, 2003). Based a recent exhaustive assessment of 484 amino acid propensity scales that combined a range of profile parameters by (Blythe and Flower, 2005), the B-cell epitope prediction based on amino acid sequence information performed only marginally better than random (Manzalawy et al, 2008).
A list of experimentally determined protein antigen-antibody structures and corresponding epitope binding information was obtained from IEDB (Vita et al, 2010). The list was filtered to contain only structures determined to a resolution less than 3 Angstrom with protein antigens of greater than 25 amino acids. Coordinate files corresponding to the filtered list were downloaded from the Protein Data Bank (PDB, http://www.rcsb.org/pdb). We estimate that the final data set will contain around 100 antibody-antigen pairs. Epitope residues in the data set were defined as antigen amino acids having atoms within 4 Angstroms of antibody atoms.
We compute a set of features based upon the crystal structures obtained from RCSB PDB database: number of neighbors, central residue and patch side chain pKa, central residue and patch hydrophilicity, ratio of amino acid composition, and half sphere exposure. Features are computed with BioJava (Prlic et al, 2012) and the corresponding code is available as open-source (see Github).
We extend the intuition from support vector machines (SVMs) which involves mapping to a higher dimensional space where linear separability may be more easily achieved. We introduce these as complex features which are functions expressed using a set of basis features, mathematical operators (i.e. logical, arithmetic, trigonometric operations), and conditional expressions. Complex features are encoded in the Push programming language (Spector et al, 2005). Push is an expressive stack-based programming language designed for program synthesis. Complex features are constructed by randomly generating programs with the proposed bases.
Wrapper methods for feature selection (both forward selection and backward elimination) are applied to datasets constructed from basis features and complex features. Stratified cross-validation is used to construct 10-fold training/testing datasets because of the small number of epitope binding sites to non-binding sites. The improvement is calculated as a p-value using a 2-tailed t-test and the AUC ROC value.
-
Odorico, M. and Pellequer, J.L., 2003. BEPITOPE: predicting the location of continuous epitopes and patterns in proteins. Journal of Molecular Recognition, 16(1), pp.20-22.
-
EL‐Manzalawy, Y., Dobbs, D. and Honavar, V., 2008. Predicting linear B‐cell epitopes using string kernels. Journal of molecular recognition, 21(4), pp.243-255.
-
Vita, R., Zarebski, L., Greenbaum, J.A., Emami, H., Hoof, I., Salimi, N., Damle, R., Sette, A. and Peters, B., 2010. The immune epitope database 2.0. Nucleic acids research, 38(suppl 1), pp.D854-D862.
-
Prlić, A., Yates, A., Bliven, S.E., Rose, P.W., Jacobsen, J., Troshin, P.V., Chapman, M., Gao, J., Koh, C.H., Foisy, S. and Holland, R., 2012. BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics, 28(20), pp.2693-2695.
-
Spector, L., Klein, J. and Keijzer, M., 2005, June. The push3 execution stack and the evolution of control. In Proceedings of the 7th annual conference on Genetic and evolutionary computation (pp. 1689-1696). ACM.