Skip to content

Genome Quality Commands

Rookie_Freeman edited this page Jun 9, 2022 · 52 revisions

tree

Place bins in the reference genome tree.

> checkm tree <bin folder> <output folder>

Required parameters:

  • bin_folder: folder containing bins (FASTA format)
  • out_folder: folder to write output files

Optional parameters:

  • -r, --reduced_tree: use reduced tree for determining lineage of each bin (suitable for 16 GB machines)
  • --ali: generate HMMER alignment file for each bin in ./<output folder>/bins/<bin id>/hmmer.tree.ali.txt
  • --nt: generate nucleotide gene sequences for each bin in ./<output folder>/bins/<bin id>/genes.fna
  • -g, --genes: bins contain genes as amino acids instead of nucleotide contigs
  • -x, --extension: extension of bins (other files in folder are ignored)
  • -t, --threads: number of threads
  • --pplacer_threads: number of threads used by pplacer (memory usage increases linearly with additional threads)
  • -q, --quiet: suppress console output

Note: The following heuristic is used to establish the translation table used by Prodigal: use table 11 unless the coding density using table 4 is 5% higher than when using table 11 and the coding density under table 4 is >70%. Distinguishing between tables 4 and 25 is challenging so CheckM does not attempt to distinguish between these two tables. If you know the correct translation table for your genomes, it is recommended that you call genes outside of CheckM and provide CheckM with the protein sequences (see --genes).

tree_qa

Assess phylogenetic markers found in each bin.

> checkm tree_qa <tree folder>

Required parameters:

  • tree_folder: output folder specified during tree command

Optional parameters:

  • -o, --out_format: specifies desired output (1-5)
  1. brief summary of genome tree placement indicating the number of unique phylogenetically informative markers found, the number of markers found multiple times, and a taxon string indicating the placement of each bin within the genome tree
  2. detailed summary of genome tree placement giving a more detailed indication of where each bin is within the genome tree, general characteristics about each bin (e.g., GC, genome size, coding density), and general characteristics about all reference genomes descendant from the parental node of each bin (e.g., mean and standard deviation of GC)
  3. genome tree in Newick format decorated with IMG genome ids which can be used to examine the phylogenetic neighbours of each bin
  4. genome tree in Newick format decorated with taxonomy strings which can be used to examine the phylogenetic neighbours of each bin
  5. multiple sequence alignment of reference genomes and bins which can be used to infer a de novo genome tree
  • -f, --file: print results to file instead of the console
  • --tab_table: for tabular outputs, print a tab-separated values table instead of a table formatted for console output
  • -q, --quiet: suppress console output

lineage_set

Infer lineage-specific marker sets for each bin.

> checkm lineage_set <tree folder> <marker file>

Required parameters:

  • tree_folder: folder specified during tree command
  • marker_file: output file describing marker set for each bin

Optional parameters:

  • -u, --unique: minimum number of unique phylogenetic markers required to use lineage-specific marker set, otherwise a domain-level marker set is used
  • -m, --multi: maximum number of multi-copy phylogenetic markers before defaulting to domain-level marker set
  • --force_domain: use domain-level marker sets for all bins
  • --no_refinement: do no perform lineage-specific marker set refinement
  • -q, --quiet: suppress console output

taxon_list

List available taxonomic-specific marker sets.

> checkm taxon_list 

Optional parameters:

  • --rank: restrict list to specified taxonomic rank (e.g., life, domain, phylum, ...)

taxon_set

Generate taxonomic-specific marker set.

> checkm taxon_set <rank> <taxon> <marker file>

Required parameters:

  • rank: taxonomic rank of desired taxonomic-specific marker set (e.g., domain)
  • taxon: taxon of interest (e.g., Bacteria)
  • marker_file: output file describing taxonomic-specific marker set

Optional parameters:

  • -q, --quiet: suppress console output

analyze

Identify marker genes in bins.

> checkm analyze <marker file> <bin folder> <output folder>

Required parameters:

  • marker_file: markers for assessing bins (marker set or HMM file)
  • bin_folder: folder containing bins (FASTA format)
  • out_folder: folder to write output files

Optional parameters:

  • --ali: generate HMMER alignment file for each bin in ./<output folder>/bins/<bin id>/hmmer.analyze.ali.txt.
  • --nt: generate nucleotide gene sequences for each bin in ./<output folder>/bins/<bin id>/genes.fna.
  • -g, --genes: bins contain genes as amino acids instead of nucleotide contigs
  • -x, --extension: extension of bins (other files in folder are ignored)
  • -t, --threads: number of threads
  • -q, --quiet: suppress console output

Outputs:

  • called gene files are placed in ./<output folder>/bins/<bin id>/genes.*
  • HMMER result files are placed in ./<output folder>/bins/<bin id>/hmmer.*

qa

Assess bins for contamination and completeness.

> checkm qa <marker file> <analyze_folder>

Required parameters:

  • marker_file: marker file specified during analyze command
  • analyze_folder: folder specified during analyze command

Optional parameters:

  • -o, --out_format: specifies desired output (1-9)
  1. summary of bin completeness, contamination, and strain heterogeneity
    • Bin Id: bin identifier derived from input FASTA file
    • Marker lineage: indicates lineage used for inferring marker set (a precise indication of where a bin was placed in CheckM's reference tree can be obtained with the tree_qa command)
    • No. genomes: number of reference genomes used to infer marker set
    • No. markers: number of inferred marker genes
    • No. marker sets: number of inferred co-located marker sets
    • 0-5+: number of times each marker gene is identified
    • Completeness: estimated completeness
    • Contamination: estimated contamination
    • Strain heterogeneity: estimated strain heterogeneity
  2. extended summary of bin quality (includes GC, genome size, coding density, ...)
  3. summary of bin quality for increasingly basal lineage-specific marker sets
    • Node Id: unique id of internal node in genome tree from which lineage-specific markers were inferred
  4. list of marker genes for each bin along with the number of times each marker was identified
    • Node Id: unique id of internal node in genome tree from which lineage-specific markers were inferred
    • Marker lineage: indicates lineage used for inferring marker set
    • Useful for identifying lineage-specific gene loss or duplication
  5. list of bin id, marker gene id, and called gene id for each identified marker gene
  6. list of marker genes present multiple times in a bin
  7. list of marker genes present multiple times on the same scaffold
    • Useful for identifying true gene duplication events, gene calling errors, or assembly errors. See note below.
  8. list indicating the position of each marker genes within a bin
  9. FASTA file of marker genes identified in each bin
  • --exclude_markers: file specifying markers to exclude from marker sets. Each marker to exclude should be listed on a separate line of the file.
  • --individual_markers: treat marker as independent (i.e., ignore co-located set structure)
  • --skip_orf_correction: skip identification of ORF calling errors affecting marker genes
  • --aai_strain: amino acid identity (AAI) threshold used to identify strain heterogeneity
  • -a, --alignment_file: produce file showing alignment of multi-copy genes and their AAI identity which can be used to further assess strain heterogeneity
  • --ignore_thresholds: ignore model-specific score thresholds
  • -e, --e_value: e-value cut-off (not used if model-specific thresholds are specified)
  • -l, --length: percent overlap between target and query (not used if model-specific thresholds are specified)
  • -c, --coverage_file: file containing coverage of each sequence; coverage information is appended to table type 2 when this file is provided (see coverage command)
  • -f, --file: print results to file instead of the console
  • --tab_table: for tabular outputs, print a tab-separated values table instead of a table formatted for console output
  • -t, --threads: number of threads
  • -q, --quiet: suppress console output

Note: Adjacent called genes matching the same marker gene may indicate a true duplication event, a gene calling error, or an assembly error. If adjacent genes hit distinct regions of the same marker gene HMM, CheckM assumes a gene calling error has occurred and concatenate the two genes. When this occurs, CheckM concatenates the gene ids of the two genes with a pair of ampersands (&&).

Clone this wiki locally