-
Notifications
You must be signed in to change notification settings - Fork 7
mash
The mash reference file /db/RefSeqSketchesDefaults.msh
found in the staphb/mash:2.3 Docker image is from RefSeq version 77. There is nothing particularity wrong about this file, but RefSeq version 216 came was released January 13, 2023. Over time, the names of organisms may change as well as species boundaries. RefSeq, however, continues to grow with each release, and it is not feasible to contain a current mash reference file in this repository or in a container for use.
A more-current mash reference file prepared for Grandeur has been uploaded to Zenodo and can be downloaded via a browser or from the command line with
wget https://zenodo.org/record/7348463/files/rep-genomes.msh
Then set the params.mash_db parameter to your new file on the command line or in a config file.
params.mash_db = "/path/to/rep-genomes.msh"
This file was created with mash and datasets with Grandeur/bin/new_mash_ref.sh
# getting the ids for representative genomes
datasets summary genome taxon bacteria --reference --as-json-lines | \
dataformat tsv genome --fields accession,assminfo-refseq-category,organism-name --elide-header | \
grep representative | \
tee representative_genomes.txt | \
cut -f 1 > genome_ids.txt
# downloading genomes
datasets download genome accession --inputfile genome_ids.txt --filename rep-genomes.zip
# extracting genomes
unzip rep-genomes.zip
# combining genomes
cat ncbi_dataset/data/*/*.fna | sed 's/ /_/g' | sed 's/,//g' > rep-genomes.fasta
# sketching genomes
mash sketch -i -p 20 rep-genomes.fasta -o rep-genomes
Grandeur parses the output file from mash dist and is particular about genus and species being at the beginning of the id line (i.e. > ${genus}_${species}_...
), so there may be compatibility issues with other mash references.
More information about this and other RefSeq released can be found at https://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/.
-
- amrfinderplus
- bbduk
- blastn
- blobtools_*
- core_genome_evaluation
- circulocov
- datasets_*
- drprg
- elgato
- emmtyper
- fastani
- fastp
- fastqc
- heatcluster
- iqtree2
- kaptive
- kleborate
- kraken2
- mash_*
- mashtree
- mlst
- multiqc
- mykrobe
- panaroo
- pbptyper
- phytreeviz
- plasmidfinder
- prokka
- quast
- seqsero2
- serotypefinder
- shigatyper
- snp_dists
- spades