-
Notifications
You must be signed in to change notification settings - Fork 74
Genome Quality Commands
Place bins in the reference genome tree.
> checkm tree <bin folder> <output folder>
Required parameters:
- bin_folder: folder containing bins (fasta format)
- out_folder: folder to write output files
Optional parameters:
- --ali: generate HMMER alignment file for each bin in
./<output folder>/bins/<bin id>/hmmer.tree.ali.txt
. - --nt: generate nucleotide gene sequences for each bin in
./<output folder>/bins/<bin id>/prodigal.fna
. - -x, --extension: extension of bins (other files in folder are ignored)
- -t, --threads: number of threads
- -q, --quiet: suppress console output
Assess phylogenetic markers found in each bin.
> checkm tree_qa <tree folder>
Required parameters:
- tree_folder: output folder specified during
tree
command
Optional parameters:
- -o, --out_format: specifies desired output (1-5)
-
- brief summary of genome tree placement indicating the number of unique phylogenetically informative markers found, the number of markers found multiple times, and a taxon string indicating the placement of each bin within the genome tree
-
- detailed summary of genome tree placement giving a more detailed indication of where each bin is within the genome tree, general characteristics about each bin (e.g., GC, genome size, coding density), and general characteristics about reference genomes descendant from each bin (e.g., mean and standard deviation of GC)
-
- genome tree in Newick format decorated with IMG genome ids which can be used to examine the phylogenetic neighbours of each bin
-
- genome tree in Newick format decorated with taxonomy strings which can be used to examine the phylogenetic neighbours of each bin
-
- multiple sequence alignment of reference genomes and bins which can be used to infer a de novo genome tree
- -f, --file: print results to file instead of the console
- --tab_table: for tabular outputs, print a tab-separated values table instead of a table formatted for console output
- -q, --quiet: suppress console output
Infer lineage-specific marker sets for each bin.
> checkm lineage_set <tree folder> <marker file>
Required parameters:
- tree_folder: folder specified during
tree
command - marker_file: output file describing marker set for each bin
Optional parameters:
- -u, --unique: minimum number of unique phylogenetic markers required to use lineage-specific marker set, otherwise a domain-level marker set is used
- -m, --multi: maximum number of multi-copy phylogenetic markers before defaulting to domain-level marker set
- --force_domain: use domain-level marker sets for all bins
- --refinement: [Experimental] perform lineage-specific marker set refinement (not currently recommended)
- -r, --num_genomes_refine: [Experimental] minimum reference genomes required to refine marker set
- -q, --quiet: suppress console output
List available taxonomic-specific marker sets.
> checkm taxon_list
Optional parameters:
- --rank: restrict list to specified taxonomic rank (e.g., life, domain, phylum, ...)
Generate taxonomic-specific marker set.
> checkm taxon_set <rank> <taxon> <marker file>
Required parameters:
- rank: taxonomic rank of desired taxonomic-specific marker set (e.g., domain)
- taxon: taxon of interest (e.g., Bacteria)
- marker_file: output file describing taxonomic-specific marker set
Optional parameters:
- -q, --quiet: suppress console output
Identify marker genes in bins.
> checkm analyze <marker file> <bin folder> <output folder>
Required parameters:
- marker_file: markers for assessing bins (marker set or HMM file)
- bin_folder: folder containing bins (fasta format)
- out_folder: folder to write output files
Optional parameters:
- --ali: generate HMMER alignment file for each bin in
./<output folder>/bins/<bin id>/hmmer.analyze.ali.txt
. - --nt: generate nucleotide gene sequences for each bin in
./<output folder>/bins/<bin id>/prodigal.fna
. - -x, --extension: extension of bins (other files in folder are ignored)
- -t, --threads: number of threads
- -q, --quiet: suppress console output
Outputs:
- called gene files are placed in
./<output folder>/bins/<bin id>/prodigal.*
- HMMER result files are placed in
./<output folder>/bins/<bin id>/hmmer.*
Assess bins for contamination and completeness.
> checkm qa <marker file> <output folder>
Required parameters:
- marker_file: marker file specified during
analyze
command - analyze_folder: folder specified during
analyze
command
Optional parameters:
- -o, --out_format: specifies desired output (1-5)
-
- summary of bin completeness, contamination, and strain heterogeneity
- Bin Id: bin identifier derived from input fasta file
- Marker lineage: indicates lineage used for inferring marker set
- 0-5+: number of times each marker gene is identified
- Completeness: estimated completeness
- Contamination: estimated contamination
- Strain heterogeneity: estimated strain heterogeneity
-
- extended summary of bin quality (includes GC, genome size, coding density, ...)
-
- summary of bin quality for increasingly basal lineage-specific marker sets
- Node Id: unique id of internal node in genome tree from which lineage-specific markers were inferred
-
- list of marker genes for each bin along with the number of times each marker was identified
- Node Id: unique id of internal node in genome tree from which lineage-specific markers were inferred
- Marker lineage: indicates lineage used for inferring marker set
- Useful for identifying lineage-specific gene loss or duplication
-
- list of bin id, marker gene id, and called gene id for each identified marker gene
-
- list of marker genes present multiple times in a bin
-
- list of marker genes present multiple times on the same scaffold
- Useful for identifying true gene duplication events, gene calling errors, or an assembly errors. See note below.
-
- list indicating position of each marker genes within a bin
-
- list of scaffold statistics: scaffold id, bin id, length, GC, ..., identified marker gene(s)
-
- --individual_markers: treat marker as independent (i.e., ignore co-located set structure)
- --skip_orf_correction: skip identification of ORF calling errors affecting marker genes
- --aai_strain: amino acid identity (AAI) threshold used to identify strain heterogeneity
- -a, --alignment_file: produce file showing alignment of multi-copy genes and their AAI identity which can be used to further assess strain heterogeneity
- --ignore_thresholds: ignore model-specific score thresholds
- -e, --e_value: e-value cut-off (not used if model-specific thresholds are specified)
- -l, --length: percent overlap between target and query (not used if model-specific thresholds are specified)
- -c, --coverage_file: file containing coverage of each sequence; coverage information is appended to table type 2 when this file is provided (see
coverage
command) - -f, --file: print results to file instead of the console
- --tab_table: for tabular outputs, print a tab-separated values table instead of a table formatted for console output
- -t, --threads: number of threads
- -q, --quiet: suppress console output
Note: Adjacent called genes matching the same marker gene may indicate a true duplication event, a gene calling error, or an assembly error. If adjacent genes hit distinct regions of the same marker gene HMM, CheckM assumes a gene calling error has occurred and concatenate the two genes. When this occurs, CheckM concatenates the gene ids of the two genes with a pair of ampersands (&&).