This package generates a candidate protein library in two phases:
-
Parsing a FASTG file to create graph traversals of longer stretches of DNA
- FASTG is parsed into a directed graph. A depth-first search is made on all connecting edges. The DFS traversal is then used to concatenate all DNA sequences in the path.
- DNA sequences are translated to mRNA and split into candidate proteins at the stop codon. Each DFS traversal can, and will, produce a set of candidate protein sequences.
- Protein sequences are filtered on length and amino acid redundancy.
- Protein sequences are cleaved into peptide sequences.
- DFS traversals, proteins and peptides are stored in a SQLite database. The linking relationship between all three is maintained in the DB.
- A FASTA file of peptides is produced for the user. This FASTA file is to be used in a search against MSMS data.
-
Using verified peptides as a filter to produce a final candidate peptide library
- The user will invoke the code with
- DB
- list of peptide sequences or peptide FASTA
- It is expected that the submitted peptides have been verified against MSMS and they represent found and identified peptide sequences
- The verified peptides are used to filter proteins from the database, these proteins become the final library.
- The verified peptides are used to score the proteins for
- coverage
- percent of verified v. total peptide association
- Final user output
- SQLite database
- Protein score text file, comma delimited
- Filtered protein FASTA file
- The user will invoke the code with