-
Notifications
You must be signed in to change notification settings - Fork 3
Available features
The MetaSBT
framework provides a set of subroutines (here called modules) for (i) building a taxonomically-organized database of microbial genomes, (ii) defining boundaries for each of the clusters in the database at different taxonomic levels to explore and define the intra-cluster genetic diversity, (iii) profiling genomes and metagenome-assembled genomes by querying the database and reporting the closest lineage and reference genome, (iv) updating a database with new reference genomes and metagenome-assembled genomes by also defining new and yet-to-be-named clusters, (v) and finally producing a report of the current status of a database.
It follows a detailed explanation of all the available modules.
The index
subroutine allows to automatically retrieve genomes from isolate sequencing from the NCBI GenBank and organise them in folders that reflect their taxonomic classification. It finally makes use of howdesbt
to rapidly index all the genomes at the species level and create a sequence bloom tree for each of the species. Lower taxonomic levels are indexed by building new sequence bloom trees considering only the root nodes of the trees at the upper taxonomic levels.
The following command will trigger the generation of the database with all the available bacterial genomes from isolate sequencing in NCBI GenBank:
metasbt index --db-dir ~/myindex \
--estimate-filter-size \
--increase-filter-size 5.0 \
--estimate-kmer-size \
--superkingdom Bacteria \
--dereplicate \
--similarity 100.0 \
--completeness 50.0 \
--contamination 5.0 \
--parallel 4 \
--nproc 2 \
--pplacer-threads 2 \
--jellyfish-threads 2 \
--tmp-dir ~/tmp \
--cleanup \
--verbose
In case you would like to create a database from a specific set of genomes available on your file system, you may use the --input-list
instead of the --superkingdom
option. It should point to a table file with the list of file paths to the input genomes on the first column and an optional second column with the full taxonomic labels of the input genomes in a tab-separated values format.
Note: you may want to get rid of the taxonomic organization of genomes by specifying the
--flat-structure
option that will consider all the input genomes together for the generation of a single sequence bloom tree. Please note that the--flat-structure
option is not compatible with theupdate
module for updating the database with new genomes. Thus, in case you will need to update the sequence bloom tree, there are no other options than building the database from scratch with the new set of genomes.
Option | Default | Mandatory | Description |
---|---|---|---|
--cluster-prefix |
MSBT |
Prefix of clusters numerical identifiers | |
--completeness |
0.0 |
Minimum completeness percentage allowed for input genomes | |
--contamination |
100.0 |
Maximum contamination percentage allowed for input genomes | |
--cleanup |
False |
Remove temporary data at the end of the pipeline | |
--closely-related |
False |
For closely related genomes use this flag | |
--db-dir |
⚑ | Database folder path | |
--dereplicate |
False |
Dereplicate input genomes | |
--estimate-filter-size |
False |
Estimate the bloom filter size with ntCard. It automatically overrides the --filter-size option |
|
--estimate-kmer-size |
False |
Estimate the optimal kmer size with kitsune. It automatically overrides the --kmer-len option |
|
--filter-size |
Bloom filter size | ||
--flat-structure |
Organize genomes without any taxonomic organization. This will lead to the creation of a single sequence bloom tree | ||
--help |
Print the list of arguments and exit | ||
--increase-filter-size |
0.0 |
Increase the estimated filter size by the specified percentage. This is used in conjunction with the --estimate-filter-size argument only. It is highly recommended to increase the filter size by a good percentage in case you are planning to update the index with new genomes |
|
--input-list |
Path to the input table with a list of genome file paths and an optional column with their taxonomic labels. Please note that the input genome files must be gz compressed with fna extension (i.e.: *.fna.gz) | ||
--jellyfish-threads |
1 |
Maximum number of threads for Jellyfish. This is required to maximise the kitsune performances | |
--kingdom |
Consider genomes whose lineage belongs to a specific kingdom. It is optional and must be provided in conjunction with --superkingdom | ||
--kmer-len |
This is the length of the kmers used for building bloom filters | ||
--limit-estimation-number |
Limit the number of genomes per group to be considered as input for kitsune and ntCard. It overrides --limit-estimation-percentage in case of a number > 0 | ||
--limit-estimation-percentage |
100.0 |
Percentage on the total number of genomes per group to be considered as input for kitsune and ntCard | |
--limit-genomes |
Inf |
Limit the number of genomes per species. This will remove the exceeding number of genomes randomly to cut the overall number of genomes per species to this number | |
--limit-kmer-size |
32 |
Limit the estimation of the optimal kmer size with kitsune to this value at most | |
--log |
Path to the log file | ||
--max-genomes |
Inf |
Consider species with this number of genomes at most | |
--min-genomes |
1 |
Consider species with a minimum number of genomes | |
--nproc |
1 |
This argument refers to the number of processors used for parallelizing the pipeline when possible | |
--parallel |
1 |
Maximum number of processors to process each NCBI tax ID in parallel | |
--pplacer-threads |
1 |
Maximum number of threads for pplacer. This is required to maximise the CheckM performances | |
--similarity |
100.0 |
Dereplicate genomes if they have a percentage of common kmers greater than or equals to the specified one. This is used exclusively in conjunction with the --dereplicate argument |
|
--superkingdom |
Consider genomes whose lineage belongs to a specific superkingdom if --input-list is not provided | ||
--tmp-dir |
⚑ | Path to the temporary folder | |
--verbose |
False |
Print results on screen | |
--version |
Print current module version and exit |
The boundaries
module is crucial for the definition of taxonomy-specific boundaries. It explots howdesbt
to build a table with the minimum and maximum amount of kmers in common between all the genomes in a particular taxonomic level. These boundaries are then used by the update
module in order to establish whether a new genome must be assigned to the closest cluster identified by the profile
module.
The following command will trigger the definition of the kmer boundaries for each taxonomic level in the database:
metasbt boundaries --db-dir ~/myindex \
--min-genomes 50 \
--output ~/boundaries.tsv \
--tmp-dir ~/tmp \
--nproc 4 \
--cleanup \
--verbose
Note: the
boundaries
module considers clusters with reference genomes only. These clusters can be considered for establishing boundaries depending on a minimum number of reference genomes that can be set with the--min-genomes
argument.
Option | Default | Mandatory | Description |
---|---|---|---|
--cleanup |
False |
Remove temporary data at the end of the pipeline | |
--db-dir |
⚑ | Database directory with the taxonomically organised sequence bloom trees | |
--flat-structure |
False |
Genomes in the database have been organized without a taxonomic structure | |
--help |
Print the list of arguments and exit | ||
--superkingdom |
Consider genomes whose lineage belongs to a specific superkingdom | ||
--log |
Path to the log file | ||
--max-genomes |
Maximum number of genomes per cluster to be considered for computing boundaries. Genomes are selected randomly in case the size of clusters is greater than this number. This must be greater than or equals to --min-genomes | ||
--min-genomes |
3 |
Consider clusters with a minimum number of genomes only | |
--nproc |
1 |
This argument refers to the number of processors used for parallelizing the pipeline when possible | |
--output |
⚑ | Output file with kmer boundaries for each of the taxonomic labels in the database | |
--tmp-dir |
⚑ | Path to the temporary folder | |
--verbose |
False |
Print results on screen | |
--version |
Print current module version and exit |
The profile
module allows to characterize an input genome according to the closest lineage in the database. It allows to process only one genome in input at a time:
metasbt profile --input-file ~/mymag.fna \
--input-id mymag \
--input-type genome \
--tree ~/myindex/k__Bacteria/index.detbrief.sbt \
--threshold 0.7 \
--expand \
--stop-at family \
--output-dir ~/profiles \
--output-prefix mymag \
--verbose
Note: with the example proposed above, we explicitly set the
--stop-at
argument tofamily
. This argument works in conjunction with the--expand
option only, and it will prevent epanding the query to all the taxonomic levels lower than the specified one. Also note that the--expand
argument expands the input query up to the species level by default, by also reporting the closest genome, without the need to use the--stop-at
argument.
Option | Default | Mandatory | Description |
---|---|---|---|
--expand |
False |
Expand the input query on all the taxonomic levels | |
--input-file |
⚑ | Path to the input query | |
--input-id |
Unique identifier of the input query | ||
--input-type |
⚑ | Accepted input types are genomes or files with one nucleotide sequence per line. In case of genomes, if they contain multiple sequences, results are merged together | |
--log |
Path to the log file | ||
--output-dir |
⚑ | Output folders with queries results | |
--output-prefix |
Prefix of the output files with query matches | ||
--stop-at |
Stop expanding queries at a specific taxonomic level. Please note that this argument works in conjunction with --expand only | ||
--threshold |
0.0 |
Fraction of query kmers that must be present in a leaf to be considered a match | |
--tree |
⚑ | This is the tree definition file | |
--verbose |
False |
Print results on screen | |
--version |
Print current module version and exit |
⚠️ This module is still under development. We strongly suggest to avoid mentioning results produced with this specific module in scientific manuscripts until a stable version will be released
This module can be used to add new reference genomes and metagenome-assembled genomes (MAGs) to the database generated with the index
module.
In case of new MAGs, it first try to profile them by comparing the input genomes with those present in the database. An input genome is assigned to the closest genome in the database if their set of kmers result similar enough. In case the profiler will not be able to characterize the input genome, the update
subroutine will exploit howdesbt
to automatically group the input genomes according to the taxonomy-specific boundaries identified with the boundaries
utility, and it finally generate new clusters of potentially novel and yet-to-be-named species.
In case of new reference genomes from isolate sequencing, the update
module simply add the new genomes to the corresponding species by rebuilding the species trees and all the trees at the lower taxonomic levels (please note that the taxonomic labels of the input reference genomes are known). If the taxonomy of an input reference genome is not in the database, the module will compare the input genome with all the cluster of species with no references before generating a new branch in the index database. If the input genome results very close to a cluster of unknown genomes, its taxonomic label will be inherited by all the genomes in the unknown cluster.
The update
module can be launched with the following command:
metasbt update --input-list ~/mygenomes.txt \
--taxa ~/taxonomies.tsv \
--boundaries ~/boundaries.tsv \
--db-dir ~/myindex \
--tmp-dir ~/tmp \
--dereplicate \
--similarity 100.0 \
--completeness 50.0 \
--contamination 5.0 \
--type references \
--extension fna.gz \
--parallel 4 \
--nproc 2 \
--pplacer-threads 2 \
--cleanup \
--verbose
Please note that MetaSBT
requires that all your input genomes have the same format and extension before running the pipeline. You can easily uniform your genome files extension by running the uniform_inputs.sh
script under the scripts
folder of this repository:
sh ./scripts/uniform_inputs.sh ~/mygenomes fa fna
The first argument is the path to the folder with the set of input genome files, while the second and third arguments are the current and new file extensions respectively. In this particular case, all the *.fa
files are converted to *.fna
files.
In case of files with multiple extensions, please run the same script as many times as you need to uniform all the input genome files extensions, until the number of files with the new extension matches the actual number of files in the input folder:
sh ./scripts/uniform_inputs.sh ~/mygenomes fa fna
sh ./scripts/uniform_inputs.sh ~/mygenomes fasta fna
sh ./scripts/uniform_inputs.sh ~/mygenomes fa.gz fna.gz
Option | Default | Mandatory | Description |
---|---|---|---|
--boundaries |
⚑ | Path to the output table produced by the boundaries module |
|
--boundary-uncertainty |
0.0 |
Define the percentage of kmers to enlarge and reduce boundaries | |
--cluster-prefix |
MSBT |
Prefix of clusters numerical identifiers | |
--completeness |
0.0 |
Minimum completeness percentage allowed for input genomes | |
--contamination |
100.0 |
Maximum contamination percentage allowed for input genomes | |
--cleanup |
False |
Remove temporary data at the end of the pipeline | |
--db-dir |
⚑ | Database folder path | |
--dereplicate |
False |
Dereplicate input genomes | |
--extension |
Specify the input genome files extension. All the input genomes must have the same file extension before running this module | ||
--input-list |
⚑ | This file contains the list of paths to the new genomes that will be added to the database | |
--log |
Path to the log file | ||
--nproc |
1 |
This argument refers to the number of processors used for parallelizing the pipeline when possible | |
--parallel |
1 |
Maximum number of processors to process each NCBI tax ID in parallel | |
--pplacer-threads |
1 |
Maximum number of threads for pplacer. This is required to maximise the CheckM performances | |
--similarity |
100.0 |
Dereplicate genomes if they have a percentage of common kmers greater than or equals to the specified one. This is used exclusively in conjunction with the --dereplicate argument |
|
--taxa |
⚑ | Input file with the mapping between input genome IDs and their taxonomic label. This is used in case of reference genomes only | |
--tmp-dir |
⚑ | Path to the temporary folder | |
--type |
⚑ | Define the nature of the input genomes | |
--verbose |
False |
Print results on screen | |
--version |
Print current module version and exit |
Once the database is built and updated with new MAGs and reference genomes, you can easily extract relevant information about all the species in your database by running the following command:
metasbt report --db-dir ~/myindex \
--output-file ~/report.tsv
The output file is a table that will contain the number of MAGs and reference genomes, in addition to the mean completeness, contamination, and strain heterogeneity percentages for each lineage in the database. Please note that lineages with no reference genomes correspond to newly defined clusters and potentially new and still-to-be-named species.
Option | Mandatory | Description |
---|---|---|
--db-dir |
⚑ | Database folder path |
--output-file |
⚑ | Path to the output table |
--version |
Print current module version and exit |
MetaSBT | Releases | Wiki | MetaSBT-DBs | License