For workstations with limited RAM we provide a script to split the set of all reference genomes into multiple smaller partitions. The partitions created by the script are directories with symlinks to the original genome files. With default settings the resulting MetaCache database built from a partition will be about the same size as the combined uncompressed FASTA file sizes within the partition.
To split the set of all reference genomes into partitions with a given maximum size run
metacache-partition-genomes <path to genomes> <partition size in MB>
Then run metacache build
for each of the partition directories.
Create 16 GB partitions from 40 GB of reference genomes (which results in 3 partitions) and build databases from them:
metacache-partition-genomes path/to/mygenomes 16000
metacache build mydb_1 path/to/mygenomes_1 ...
metacache build mydb_2 path/to/mygenomes_2 ...
metacache build mydb_3 path/to/mygenomes_3 ...
After database construction
- query your read dataset(s) against each database partition with
metacache query
- merge the individual query results with
metacache merge
to obtain a global result based on all reference genomes
metacache query mydb_1 myreads.fa -tophits -queryids -lowest species -out res1.txt
metacache query mydb_2 myreads.fa -tophits -queryids -lowest species -out res2.txt
metacache query mydb_3 myreads.fa -tophits -queryids -lowest species -out res3.txt
metacache merge res1.txt res2.txt res3.txt -taxonomy ncbi_taxonomy
In order to be mergable, queries must be run with command line options
-tophits -queryids -lowest species
and must NOT be run with options that suppress or alter the default output
like, e.g.: -no-map
, -no-summary
, -separator
, etc.