Skip to content

Toolkit for bioinformatic calculations with peptides on Apache Spark

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

Peptide toolkit

This is a toolkit for bioinformatical calculations with peptides (short proteins/amino acid sequences) on Apache Spark.


  • Estimation of binding affinities (IC50 value) between peptides and human major histocompatibility complex (MHC) genes
  • Generation of all possible peptides with given length
  • Inter-peptide distance calculation using PAM/BLOSUM substitution matrices
  • Clustering of peptides using k-medoids algorithm
  • Clustering of (generated) peptide set around given cluster centers set (of real-world peptides)


3rd-party code included or used

  • PSSMHCpan-1.0 : a toolkit for estimation of peptide binding affinities.
    It includes an amount of pre-calculated weight matrices (one for each HLA allele to peptide length pair) and a Perl script for binding affinity estimation.
    Original Perl code from this toolkit was rewritten in Java for Spark.
    One sample weight matrix (for HLA-A0201 allele and 9-meer peptides) is included into this package, the rest should be copied into the working tree from the original PSSMHCpan package as needed.
    Available on
  • NW-align : Java implementation of Needleman-Wunsch global alignment, included (slightly modified) for benchmarking and comparison purposes only.
    Available on
  • BLOSUM and PAM amino acid substitution matrices : reference matrices from NCBI are hardcoded.
    Available on


In clustering applications, similarity Sab between peptides A and B, using substitution matrix M, is calculated as following :

Sab = 2*SCab/(SCaa + SCbb), where  
SCxy = sum(M(x[i],y[i])), where  
M(k,n) is a substitution matrix value for amino acids k and n, where  
x[i] is the amino acid in peptide x in position i  


  • Netbeans 8.2 IDE
  • Java 1.8.0
  • executed under Windows 7 x64 and various Linux distributions (Ubuntu 14, CentOs 6 and AltLinux 7.0.5)


  • Start Apache Spark master and worker(s) processes
  • Submit Spark task :

spark-submit --class <org.package.MainClassName> --master spark://master_hostname:7077 <package_filename.jar> [Config.xml]

  • For peptide generation, class name is org.PSSMHC.PSSMHCSpark and package filename is PSSMHCpan-1.0.jar
  • For clustering around given centers, class name is org.PeptideClustering.AssignBindersToClusters and package filename is peptide-clustering-1.0.jar
  • For k-medoids clustering, class name is org.PeptideClustering.PeptideClusteringMain and package filename is peptide-clustering-1.0.jar

Default config filename is PeptideCfg.xml in current directory.
Configuration is explained in xml comments to the sample config file \PSSMHCpan\src\main\resources\PeptideCfg.xml

So, example command line is :

spark-submit --class org.PSSMHC.PSSMHCSpark --master spark://master_hostname:7077 PSSMHCpan-1.0.jar

Output :

Running each of Spark applications will produce

  • some logging in stdout

  • depending on xml configuration, some datasets (Spark RDDs saved as text) in output-* folders in Spark work directory.
    In Spark standalone mode, stdout is written the command line windows where you've executed spark-submit, and Spark working directory is your current directory.

    • Peptide generation produces the folder output-pssmhc containing a gzip-ed set of binder peptides, formatted as :
      <peptide>, <binding affinity>
    • Clustering around given centers produces folders
    • output-centers containing a set of binder cluster centers (format described above)
    • output-clusters containing the cluster elements, formatted as :
      (<center peptide>,[<element peptide 1>,<binding affinity 1>, ..., <element peptide N>,<binding affinity N>])
    • K-medoids clustering only logs the solution into stdout in following format :
Medoid : <center peptide> totalSim <sum of similarities between center and elements> avgSim <average similarity between center and elements>  
       Element : <element peptide 1> <binding affinity 1>  
       Element : <element peptide N> <binding affinity N>  


Toolkit for bioinformatic calculations with peptides on Apache Spark







No packages published