We introduce aggregate MIcrobiome COVerage (micov), a bioinformatic tool that efficiently computes precise, optionally-aggregated, genomic coverage positions across numerous metagenomes and arbitrary sample types. Micov offers three key advantages over conventional tools: rapid sample type-specific cumulative coverage calculations, identification of mobile or polymorphic genetic elements, and detection of strain heterogeneity through coverage variations.
The primary input mapping structure for micov is SAM/BAM or BED (3-column).
Coverage data can be aggregated into Qiita-like coverage.tgz
files. Per-sample
coverages can be then be harvested from multiple coverage.tgz
files.
Why coverage.tgz
files? Qiita provides a rich set of already computed
coverage data in a BED3 compatible format. Rather than invent
yet-another-format, we opted to establish functionality on what is readily
available from that resource.
We currently recommend creating a separate conda environment, and installing into that
$ conda env create -f micov.yml
$ conda activate micov
$ pip install micov
To install the most up-to-date version of micov
$ conda create -n micov -c conda-forge polars matplotlib scipy click tqdm numba duckdb pyarrow
$ conda activate micov
$ git clone https://github.com/biocore/micov.git
$ cd micov
$ pip install -e .
See below for examples of running micov on SAM files.
First, activate the Conda environment where micov is installed:
conda activate micov
If you already have tgz format coverage files from Qitta, go to step 4.
micov currently only processes headerless SAM/BAM files. If your input files contain headers, remove them using samtools
before running micov:
samtools view -S input.sam > output.sam
Now, create an output directory and run micov on each SAM file:
mkdir -p "./example/coverages"
for file in ./example/samfiles/*.sam.xz; do
filename=$(basename "$file" .sam.xz)
echo "Processing $file..."
# Run micov compress
xzcat $file | micov compress | gzip > "./example/coverages/${filename}.cov.gz"
done
After extracting coverage data, consolidate the .cov
files into a compressed .tgz
archive. This requires a length mapping file (length.tsv
), which maps genome IDs to their corresponding genome lengths. An example length file can be found in ./example/metadata/length.tsv
. If this file is not available, it can be generated using seqkit
:
seqkit fx2tab --length --name --header-line foo.fasta > length.tsv
Now, consolidate the coverage files:
mkdir -p "./example/consolidate"
find "./example/coverages" -type f -name '*.cov.gz' > "./example/consolidate/paths.txt"
micov consolidate \
--lengths "./example/metadata/length.tsv" \
--paths "./example/consolidate/paths.txt" \
--output "./example/consolidate/consolidated.tgz"
micov provides functionality to convert TGZ-formatted coverage data into Parquet format for efficient querying and processing:
mkdir -p "./example/parquet"
micov qiita-to-parquet \
--qiita-coverages "./example/consolidate/consolidated.tgz" \
--output "./example/parquet/example" \
--lengths "./example/metadata/length.tsv"
micov can generate per-sample-group plots based on sample metadata and a feature metadata file. Ensure that feature_metadata.txt
has a header like feature_id
as the first line and no blank lines at the end. This produces non-cumulative, cumulative, scaled, and unscaled position plots for each genome in feature metadata.
mkdir -p "./example/plots/per_sample_groups"
micov per-sample-group \
--parquet-coverage "./example/parquet/example" \
--sample-metadata "./example/metadata/sample_metadata.txt" \
--sample-metadata-column "dog" \
--features-to-keep "./example/metadata/feature_metadata.txt" \
--output "./example/plots/per_sample_groups/example" \
--plot
For Monte Carlo-based sample group plots, run:
mkdir -p "./example/plots/per_sample_groups_monte"
micov per-sample-group \
--parquet-coverage "./example/parquet/example" \
--sample-metadata "./example/metadata/sample_metadata.txt" \
--sample-metadata-column "dog" \
--features-to-keep "./example/metadata/feature_metadata.txt" \
--output "./example/plots/per_sample_groups_monte/example" \
--plot \
--monte "unfocused" \
--monte-iters 100
Exising .SAM/.BAM can be compressed into a BED-like format by file or pipe. A pipe example is shown below:
$ xzcat some_data.sam.xz | micov compress > compressed.tsv
Aggregate genome coverages can be calculated using
$ xzcat some_data.sam.xz | micov compress --length length.tsv > coverages.tsv
or
$ micov compress \
--data input.sam \
--output compressed_output.tsv \
--lengths genome-lengths.tsv \
--taxonomy taxonomy.tsv