-
Notifications
You must be signed in to change notification settings - Fork 73
Installation
CheckM is designed to run on Linux. The limiting requirement for CheckM is memory. Inference of lineage-specific marker sets using the full reference genome tree required approximately 40 GB of memory. However, a reduced genome tree (--reduced_tree) can also be used to infer lineage-specific marker sets which is suitable for machines with as little as 16 GB of memory. We recommend using the full tree if possible, though our results suggest that the same lineage-specific marker set will be selected for the vast majority of genomes regardless of the underlying reference tree. System requirements are far more modest if you plan to make use of taxonomic-specific marker sets or your own custom marker genes as this bypasses the need to place genomes in the reference genome tree.
If you plan to process a large number of genomes, you may wish to break these into smaller batches. On a 64GB machine running a 1000 genomes at a time with 40 threads works well. Exceeding the available memory of your machine will cause CheckM to use swap space (as per any program) which will substantially increase the time to process genomes.
CheckM requires the following programs to be added to your system path:
- HMMER (>=3.1b1)
-
prodigal (2.60 or >=2.6.1)
- executable must be named
prodigal
and notprodigal.linux
- executable must be named
-
pplacer (>=1.1)
- guppy, which is part of the pplacer package, must also be on your system path
- pplacer binaries can be found on the pplacer GitHub page
CheckM >=1.1.0 is a Python 3.x program and can be install through pip:
> pip3 install numpy
> pip3 install matplotlib
> pip3 install pysam
> pip3 install checkm-genome
This will install CheckM and all other required Python libraries. The bioinformatic tool dependencies need to be install separately and placed on your system path.
A CheckM Conda environment can also be setup as follows:
conda create -n checkm python=3.9
conda activate checkm
conda install -c bioconda numpy matplotlib pysam
conda install -c bioconda hmmer prodigal pplacer
pip3 install checkm-genome
A full Conda package for CheckM is also available here which has been generously put together and maintained by community members (if this is you please let me know so I can acknowledge you here!)
CheckM relies on a number of precalculated data files which can be downloaded from either:
The reference data must be decompress into a directory and the path to this data set using the CHECKM_DATA_PATH environmental variable, e.g.:
> export CHECKM_DATA_PATH=/path/to/my_checkm_data
Alternatively, the following command can be run to inform CheckM of where the files have been placed:
> checkm data setRoot <checkm_data_dir>
Note: CheckM defaults to the environmental variable CHECKM_DATA_PATH if it is set.
CheckM is now ready to run. For a list of CheckM commands type:
> checkm
You can upgrade CheckM through pip:
> pip3 install checkm-genome --upgrade --no-deps
The CheckM reference database is not expected to change until CheckM v2.
If you wish to test your installation, you can run CheckM's unit tests. This isn't necessary and is primarily meant for development purposes. However, some system administrators may find this useful. A general test of CheckM which will verify all 3rd party dependencies can be run using:
> checkm test ~/checkm_test_results
This runs the E.coli K12-W3310 genome through the standard CheckM pipeline and verifies the resulting output files. The output directory can be removed once the test has run.
Additional unit tests are provided in the test
directory. These are designed to aid in development and make use of nose.
The CheckM lineage workflow is available at KBase for those looking for a web-based solution.