Skip to content

Latest commit

 

History

History
210 lines (158 loc) · 12.5 KB

README.md

File metadata and controls

210 lines (158 loc) · 12.5 KB

KEGG-Decoder

Description

Designed to parse through a KEGG-Koala outputs (including blastKOALA, ghostKOALA, KOFAMSCAN) to determine the completeness of various metabolic pathways.

  • This module was constructed using manually curated "canonical" pathways described as part of KEGG Pathway Maps. For information regarding which KOs are used to predict a metabolic pathway see the KOALA_definitions.txt

  • if you are interested in certain pathway and the genes are listed in KEGG it is possible to add it to file (with some Python scripting)

KEGG-Decoder Demonstration and Hands-on tutorial

YouTube video on how KEGG-Decoder intefaces with KEGG and how the heatmap if organized.

Hands-on tutorial Binder

Developed as part of the BVCN

Please Cite

If you find that using KEGG Decoder to process your data has been useful, please cite this manuscript. If you are using KEGG Decoder to make figures then definitely cite this manuscript!

Dependencies

Installation

Recommend installing KEGG-Decoder in it virtual environment with PYTHON=3.6 (e.g., conda or python).

conda create -n keggdecoder python=3.6
conda activate keggdecoder
python3 -m pip install KEGGDecoder

The current pip install will set the various dependencies (matplotlib, seaborn, pandas, etc.) to versions that actively work with this version of the script. This is partially due to avoid a bug in matplotlib=3.0.4 that would cut the top and bottom line of the static image output.

Upgrade

conda activate keggdecoder
pip install --upgrade KEGGDecoder

Procedure

  • Start with protein FASTA file (INPUT_PROTEIN.fasta). This file can be multiple genomes combined. Be sure your submitted FASTA file has headers that group genomes together, KEGG-decoder.py groups based on the name provided in FASTA header before the first underscore (_)
For example
>NORP9_1
>NORP9_2
>NORP9_3
>NORP10_1
>NORP10_2
>NORP10_3
In the output this produces two rows of output, one for genome NORP9 and one for genome NORP10 in the list and heat map
  • Process protein sequences through KEGG-KOALA (GhostKoala, BlastKoala, or KOFAMSCAN) and download the tab-delimited KO assignment text file (KOALA_OUTPUT.txt)
  • The KOALA output text file should look like this:
NORP9_1	K00370
NORP9_2	K00371
  • Run KEGG-decoder
KEGG-decoder --input (-i) <KOALA_OUTPUT.txt> --output (-o) <FUNCTION_OUT.list> --vizoption (-v) <static/interactive/tanglegram>
  • The FUNCTION_OUT.list generates a TSV version of the heat map. The first row contains pathway/process names, subsequent rows contain submitted groups/genomes and fractional percentage of pathway/process

  • 'static' figure output is an SVG file function_heatmap.svg. Each distinct identifier before the underscore in the FASTA file will have a row

  • 'interactive' figure output is an HTML file function_heatmap.html. Each distinct identifier before the underscore in the FASTA file will have a row, but can be loaded into a browser and value will be displayed by hovering over a cell with the mouse. Draw a box to zoom in on specific regions. Designed to allow easier parsing of larger sets of genomes.

  • 'tanglegram' -- For a little more advanced analysis, KEGGDecoder can generate a tanglegram to compare the order of two trees, one generated by the clustered KEGG metabolic outputs and a Newick format (presumably phylogenetic) tree provided by the user. At least 3 input genomes are required, but more is recommended. Genome names must match.

KEGG-Expander

UNDER CONSTRUCTION

While KEGG-decoder is now a module, KEGG-expander and Decoder_and_Expand will still require running the Python scripts. Using the FUNCTION_OUT.list file will allow you to still make the intended final figure.

Description

Designed to expand on the output from KEGG-Decoder. Within KEGG there is a lack of information regarding several processes of interest. To overcome these shortcomings, a small targeted HMM database was created (and will be updated) to fill in gaps of information.

HMM models are predominantly from the PFam database, but when necessary are pulled from TIGRfam and SFam.

Dependencies

Additional Information

  • Details as to which HMM models and genes are in each described pathway or process can be found in the supporting document, Pfam_definitions.txt
  • In version 0.7, KEGG-Expander targets several transporter subunits to link with metal transporter columns in KEGG-Decoder. Removed the peptidase entries due to ineffective interpretation.
  • In version 0.6, KEGG-Expander targets: phototrophy via proteorhodopsin, (some) peptidases, alternative nitrogenases, ammonia transport, DMSP lyase, and DMSP synthase, and ferrioxamine biosynthesis
  • Unfortunately, accuracy depends on the model used, using a bit score cutoff of 75 (approximately an E-value <10E-20) does not always capture the best matches. For example the rhodopsin model does not distinguish between proteorhodopsin and other light driven rhodopsins (we use a tree to determine the proteorhodopsins). Or several of the DMSP lyases at low bit scores will match metalloproteases; in this instance the script has been modified to look for a more stringent bit score (>500). Or the TIGRfam models for the Fe-only and Vanadium nitrogenases generally match the same protein.

Prodecure

  • Using a protein FASTA file with the same gene name set-up as described above - GENOMEID_Number - run a search against the custom HMM database
hmmsearch --tblout <NAME>_expanderv0.7.tbl -T 75 /path/to/BioData/KEGGDecoder/HMM_Models/expander_dbv0.7.hmm <INPUT_PROTEIN.fasta>
  • The HMM results table is used to construct the heatmap by running KEGG-expander.py
python KEGG-expander.py <NAME>_expanderv0.7.tbl <HMM_OUT.list>
  • The OUTPUT LIST generates a text version of the heat map. The first row contains pathway/process names, subsequent rows contain submitted groups/genomes and fractional percentage of pathway/process

  • Figure is output as hmm_heatmap.svg. Each distinct identifier before the underscore in the FASTA file will have a row

Decoder and Expand

Description

Combines the KEGG and HMM heatmaps in to a final heat map.

Procedure

  • Run the script Decoder_and_Expand.py
python Decode_and_Expand.py <FUNCTION_OUT.list> <HMM_OUT.list>
  • Figure is output as decode-expand_heatmap.py. Each distinct identifier before the underscore in the FASTA file will have a row

Change Log

V1.3

Added several pathways associated with carotenoid biosynthesis, including end-products: astaxanthin, nostoxanthin, zeaxanthin diglucoside, & myxoxanthophylls. Plus, staphyloaxanthin biosynthesis and the two pathways for terpenoid building blocks, the mevalonate pathway and the MEP/DOXP pathway.

The pathways were provided by Dr. Tania Kurbessoian

V1.2.1

Fixed typo in determing reverse TCA cycle as identified by KEGG-Decoder user Cheng. Issue #52

Added all-trans-8'-apo-beta-carotenal 15,15'-oxygenase which will cleave apo-carotenals to generate retinal. Suggested by Eric Webb. Upstream pathway unknown

V1.2

Added several new pathways including:

  • PET degradation
  • carbon storage, related to starch/gylcogen & polyhydroxybutyrate
  • posphate storage, related to the reversible polyphosphate reaction.

Part of summer research with Sheyla Aviles.

V1.1

Correcting typos identified by Chris Neely. Adding more complete pathways components for amino acid biosynthesis identified by Dr. Eric Webb

  • phenylalanine added K01713 pheC; cyclohexadienyl dehydratase OR K05359 ADT; arogenate/prephenate dehydratase OR K04518 pheA2; prephenate dehydratase
  • tyrosine added K00220 tyrC; cyclohexadieny/prephenate dehydrogenase OR K24018; cyclohexadieny/prephenate dehydrogenase OR K15226 tyrAa; arogenate dehydrogenase

V1.0.10

Added the 20 amino acids. In most instances, only the last step in converting precusor to amino acid is assessed (except for valine, isoleucine, leucine, and tryptophan). The following amino acids share detection pathways:

  • serine & glycine
  • threonine & glycine
  • valine & isoleucine
  • phenylalanine & tyrosine
  • aspartate & glutamate

V1.0.6-1.0.8

  • Updates made as part of the Speeding Up Science Part 2 hackathon. Updates were made by Chris Neely, Jason Fell, and Marisa Lim.
  • Changes include reduction of white space in the static output, removal of a minimum requirement for the interactive output, and increased functioning of tanglegram output. Specifically, tanglegram now uses complete-linkage Euclidean distance to determine the clusters on the KEGG-Decoder tree. This provides the best resolution for visualizing possible groups with similar functional capacity.
  • In V1.0.8.2, a correction to determining the completeness of ubiquinol-cytochrome c reductase. Previously, only checked for the presence of K00411 and K00410. K00410 is a fusion of K00412 and K00413 only present in a subset of Proteobacteria. Identified by Grayson Chadwick.
  • In V1.0.8.1, a mismatch in the terms used to identify bifunctional chitinase/lysozyme would result in a 0 not matter if K13381 was present. This has been corrected. Identified by Chris Neely.

V1.0.5

Various upgrades to the tanglegram visualization and enchanced naming efficiency.

V1.0.2

Fixed an issue with tanglegram support that should fix issue with pandas dependency V.1.0.2 Adds Na+-transporting NADH:ubiquinone oxidoreductase and several metal transporters. KEGG-Decoder added metal transporters for cobalt (CbiMQ), cobalt (CbtA), cobalt (CorA), nickel ABC-type transporter substrate-binding subunit (NirA), copper (copA), ferrous iron (FeoB), ferric iron ABC-type transporter substrate-binding subunit (AfuA), Fe/Mn transporter (MntH). Additional metal transporter components were added through KEGG-expander: Cobalt transporter (CbtB), Copper binding HMA (heavy-metal-associated) protein, Fe, Zn, Mn permease (ZupT) Removed 'peptidases' from KEGG-expander due to inability to discern intracellular from extracellular activity. Recommend using MetaSanity to identify extracellular peptidases. Updated KEGG-expander HMM set to V0.7.

V1.0

KEGGDecoder can now be installed via pip install. KEGGDecoder now offers 2 visualization outputs - the classic 'static' version and the new 'interactive' version which will open a heatmap where you zoom and interact with the heatmap output Contributions to V1.0 occured as part of the Moore Foundation funded 'Speeding Up Science' hackathon. With contributions provided by: Taylor Reiter (UCDavis), Roth Conrad (GeorgiaTech), Jay Osvatic (UniVienna), Luiz Irber (UCDavis)

V0.8

Add elements regarding arsenic reduction

V0.7

Clarifies elements of methane oxidation and adds additional methanol/alcohol dehydrogenase to KEGG function search. Adds the serine pathway for formaldehyde assimilation

V0.6

V.0.6 Adds Bacterial Secretion Systems as descrived by KEGG covering Type I, II, III, IV, Vabc, VI, Sec-SRP and Twin Arginine Targeting systems

V0.5

Adds parameters to force labels to be printed on heatmap. Includes functions for sulfolipid biosynthesis (key gene sqdB) and C-P lyase

V0.4

Adds sections that more accurately represents anoxygenic photosynthesis - type-II and type-I reaction centers, adds NiFe hydrogenase Hyd-1 hyaABC, corrected typo leading to missed assignment to hydrogen:quinone oxidoreductase

V0.3

Latest version adds checks for: retinal biosynthesis, sulfite dehydrogenase (quinone), hydrazine dehydrogenase, hydrazine synthase, DMSP/DMS/DMSO cycling, cobalamin biosynthesis, competence-related DNA transport, anaplerotic reactions