- New
kraken.py
utility inscripts
to create custom kraken2 databases with MetaSBT taxonomic labels; - The
tar
andinstall
commands have been replaced with thepack.sh
andunpack.sh
utilities inscripts
.
- Clusters' boundaries are now defined as the minimum and maximum Average Nucleotide Identity (ANI) between all the genomes under a specific cluster.
- New option
--uniform-strand
available with theindex
andupdate
modules for processing the input sequences all on the same strand. Mainly used for viral sequences; - New option
--use-representatives
available with theindex
module to use only three representative genomes at the species level; - New option
--resume
available with theindex
andupdate
modules able to resume the index and update processes in case of unexpected errors; - New
expand_fasta.py
utility inscripts
to expand input fasta files into multiple file. One fasta file for each read. Mainly used for viral sequences; - New
fastcluster.py
utility inscript
to compute a average-linkage hierarchical clustering of a set of genomes based on their Mash distances; - Both the
index
andupdate
modules now display a worning message in case the configuration file under--resume
has been previously generated with a different version of MetaSBT; - Both the
index
andupdate
modules now integrateCheckV
andEukCC
for assessing the quality of viruses and eukaryotes; CheckM
has been upgraded toCheckM2
;- The
cluster()
function inutils
is now running in parallel; - The
howdesbt bfdistance
command for computing the distances between bloom filters is now running in parallel.
- It correctly checks now for new framework versions when starting a new
metasbt
instance; - Fixed genome quality filtering on completeness and contamination during the
update
; - Improving docstring adopting the numpydoc documentation format.
First public stable release of MetaSBT.
It is composed of the following modules:
index
: build a MetaSBT database by building a series of Sequence Bloom Trees at different taxonomic levels;boundaries
: define taxonomy-specific boundaries as the minimum and maximum number of kmers in common between all the genomes under a specific cluster;profile
: taxonomically profile a genome by querying a MetaSBT database at different taxonomic levels;report
: build a report table describing the content of a MetaSBT database;update
: update a MetaSBT database with new genomes;tar
: pack a MetaSBT database into a ready-to-be-distributed tarball;install
: install a MetaSBT database tarball locally under a specific location of the file system.
The framework also comes with a set of utilities:
bf_sketch.py
: build minimal bloom filter sketches with cluster-specific marker kmers;esearch_txid.sh
: retrieve GCAs from NCBI GenBank given a specific taxonomic ID;get_ncbi_genomes.py
: retrieve reference genomes and metagenome-assembled genomes under a specific superkingdom and kingdom from NCBI GenBank;howdesbt_index.sh
: index genomes with HowDeSBT;uniform_inputs.sh
: uniform input genome files extension.