Master | |
Dev |
Subtype microbial whole-genome sequencing (WGS) data using SNV targeting k-mer subtyping schemes.
Includes 33 bp k-mer SNV subtyping schemes for Salmonella enterica subsp. enterica serovars Heidelberg, Enteritidis, and Typhimurium genomes developed by Genevieve Labbe et al., and for S. ser Typhi adapted from Wong et al., Britto et al., Rahman et al., and Klemm et al..
Works on genome assemblies (FASTA files) or reads (FASTQ files)! Accepts Gzipped FASTA/FASTQ files as input!
Also includes a Mycobacterium tuberculosis lineage scheme adapted from Coll et al. by Daniel Kein.
If you find the biohansel
tool useful, please cite as:
Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel. Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Kim Ziebell, Anil Nichani, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash. bioRxiv 2020.01.10.902056; doi: https://doi.org/10.1101/2020.01.10.902056
More in-depth information on running and installing biohansel can be found on the biohansel readthedocs page.
Each new build of biohansel
is automatically tested on Linux using Continuous Integration. biohansel
has been confirmed to work on Mac OSX (versions 10.13.5 Beta and 10.12.6) when installed with Conda.
These are the dependencies required for biohansel
:
- Python (>=v3.6)
- numpy >=1.12.1
- pandas >=0.20.1
- pyahocorasick >=1.1.6
- attrs
With Conda
Install biohansel
from Bioconda with Conda (Conda installation instructions):
# setup Conda channels for Bioconda and Conda-Forge (https://bioconda.github.io/#set-up-channels)
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
# install biohansel
conda install bio_hansel
Install biohansel
from PyPI with pip:
pip install bio_hansel
With pip from Github
Or install the latest master branch version directly from Github:
pip install git+https://github.com/phac-nml/biohansel.git@master
Install into Galaxy (version >= 17.01)
Install biohansel
from the main Galaxy toolshed:
https://toolshed.g2.bx.psu.edu/view/nml/biohansel/ba6a0af656a6
If you run hansel -h
, you should see the following usage statement:
usage: hansel [-h] [-s SCHEME] [--scheme-name SCHEME_NAME] [-M SCHEME_METADATA] [-p forward_reads reverse_reads] [-i fasta_path genome_name] [-D INPUT_DIRECTORY] [-o OUTPUT_SUMMARY] [-O OUTPUT_KMER_RESULTS] [-S OUTPUT_SIMPLE_SUMMARY] [--force] [--json] [--min-kmer-freq MIN_KMER_FREQ] [--min-kmer-frac MIN_KMER_FRAC] [--max-kmer-freq MAX_KMER_FREQ] [--low-cov-depth-freq LOW_COV_DEPTH_FREQ] [--max-missing-kmers MAX_MISSING_KMERS] [--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS] [--low-cov-warning LOW_COV_WARNING] [--max-intermediate-kmers MAX_INTERMEDIATE_KMERS] [--max-degenerate-kmers MAX_DEGENERATE_KMERS] [-t THREADS] [-v] [-V] [F [F ...]] BioHansel version 2.5.1: Subtype microbial genomes using SNV targeting k-mer subtyping schemes. Built-in schemes: * heidelberg: Salmonella enterica spp. enterica serovar Heidelberg * enteritidis: Salmonella enterica spp. enterica serovar Enteritidis * typhimurium: Salmonella enterica spp. enterica serovar Typhimurium * typhi: Salmonella enterica spp. enterica serovar Typhi * tb_lineage: Mycobacterium tuberculosis Developed by Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Kim Ziebell, Anil Nichani, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash. positional arguments: F Input genome FASTA/FASTQ files (can be Gzipped) optional arguments: -h, --help show this help message and exit -s SCHEME, --scheme SCHEME Scheme to use for subtyping (built-in: "heidelberg", "enteritidis", "typhi", "typhimurium", "tb_lineage"; OR user-specified: /path/to/user/scheme) --scheme-name SCHEME_NAME Custom user-specified SNP substyping scheme name -M SCHEME_METADATA, --scheme-metadata SCHEME_METADATA Scheme subtype metadata table (tab-delimited file with ".tsv" or ".tab" extension or CSV with ".csv" extension format accepted; MUST contain column called "subtype") -p forward_reads reverse_reads, --paired-reads forward_reads reverse_reads FASTQ paired-end reads -i fasta_path genome_name, --input-fasta-genome-name fasta_path genome_name input fasta file path AND genome name -D INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY directory of input fasta files (.fasta|.fa|.fna) or FASTQ files (paired FASTQ should have same basename with "_\d\.(fastq|fq)" postfix to be automatically paired) (files can be Gzipped) -o OUTPUT_SUMMARY, --output-summary OUTPUT_SUMMARY Subtyping summary output path (tab-delimited) -O OUTPUT_KMER_RESULTS, --output-kmer-results OUTPUT_KMER_RESULTS Subtyping kmer matching output path (tab-delimited) -S OUTPUT_SIMPLE_SUMMARY, --output-simple-summary OUTPUT_SIMPLE_SUMMARY Subtyping simple summary output path --force Force existing output files to be overwritten --json Output JSON representation of output files --min-kmer-freq MIN_KMER_FREQ Min k-mer freq/coverage --min-kmer-frac MIN_KMER_FRAC Proportion of k-mer required for detection (0.0 - 1) --max-kmer-freq MAX_KMER_FREQ Max k-mer freq/coverage --low-cov-depth-freq LOW_COV_DEPTH_FREQ Frequencies below this coverage are considered low coverage --max-missing-kmers MAX_MISSING_KMERS Decimal proportion of maximum allowable missing kmers before being considered an error. (0.0 - 1.0) --min-ambiguous-kmers MIN_AMBIGUOUS_KMERS Minimum number of missing kmers to be considered an ambiguous result --low-cov-warning LOW_COV_WARNING Overall kmer coverage below this value will trigger a low coverage warning --max-intermediate-kmers MAX_INTERMEDIATE_KMERS Decimal proportion of maximum allowable missing kmers to be considered an intermediate subtype. (0.0 - 1.0) --max-degenerate-kmers MAX_DEGENERATE_KMERS Maximum number of scheme k-mers allowed before quitting with a usage warning. Default is 100000 -t THREADS, --threads THREADS Number of parallel threads to run analysis (default=1) -v, --verbose Logging verbosity level (-v == show warnings; -vvv == show debug info) -V, --version show program's version number and exit
hansel -s heidelberg -vv -o results.tab -O match_results.tab /path/to/SRR1002850.fasta
Contents of results.tab
:
sample scheme subtype all_subtypes kmers_matching_subtype are_subtypes_consistent inconsistent_subtypes n_kmers_matching_all n_kmers_matching_all_total n_kmers_matching_positive n_kmers_matching_positive_total n_kmers_matching_subtype n_kmers_matching_subtype_total file_path SRR1002850 heidelberg 2.2.2.2.1.4 2; 2.2; 2.2.2; 2.2.2.2; 2.2.2.2.1; 2.2.2.2.1.4 1037658-2.2.2.2.1.4; 2154958-2.2.2.2.1.4; 3785187-2.2.2.2.1.4 True 202 202 17 17 3 3 SRR1002850.fasta
Contents of match_results.tab
:
kmername stitle pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen seq coverage is_trunc refposition subtype is_pos_kmer sample file_path scheme 775920-2.2.2.2 NODE_2_length_512016_cov_46.4737_ID_3 100.0 33 0 0 1 33 474875 474907 2.0000000000000002e-11 62.1 33 512016 GTTCAGGTGCTACCGAGGATCGTTTTTGGTGCG 1.0 False 775920 2.2.2.2 True SRR1002850 SRR1002850.fasta heidelberg negative3305400-2.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 276235 276267 2.0000000000000002e-11 62.1 33 427905 CATCGTGAAGCAGAACAGACGCGCATTCTTGCT 1.0 False negative3305400 2.1.1.1 False SRR1002850 SRR1002850.fasta heidelberg negative3200083-2.1 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 170918 170950 2.0000000000000002e-11 62.1 33 427905 ACCCGGTCTACCGCAAAATGGAAAGCGATATGC 1.0 False negative3200083 2.1 False SRR1002850 SRR1002850.fasta heidelberg negative3204925-2.2.3.1.5 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 175760 175792 2.0000000000000002e-11 62.1 33 427905 CTCGCTGGCAAGCAGTGCGGGTACTATCGGCGG 1.0 False negative3204925 2.2.3.1.5 False SRR1002850 SRR1002850.fasta heidelberg negative3230678-2.2.2.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 201513 201545 2.0000000000000002e-11 62.1 33 427905 AGCGGTGCGCCAAACCACCCGGAATGATGAGTG 1.0 False negative3230678 2.2.2.1.1.1 False SRR1002850 SRR1002850.fasta heidelberg negative3233869-2.1.1.1.1 NODE_3_length_427905_cov_48.1477_ID_5 100.0 33 0 0 1 33 204704 204736 2.0000000000000002e-11 62.1 33 427905 CAGCGCTGGTATGTGGCTGCACCATCGTCATTA 1.0 False [Next 196 lines omitted.]
hansel -s heidelberg -vv -t 4 -o results.tab -O match_results.tab -p SRR5646583_forward.fastqsanger SRR5646583_reverse.fastqsanger
Contents of results.tab
:
sample scheme subtype all_subtypes kmers_matching_subtype are_subtypes_consistent inconsistent_subtypes n_kmers_matching_all n_kmers_matching_all_total n_kmers_matching_positive n_kmers_matching_positive_total n_kmers_matching_subtype n_kmers_matching_subtype_total file_path SRR5646583 heidelberg 2.2.1.1.1.1 2; 2.2; 2.2.1; 2.2.1.1; 2.2.1.1.1; 2.2.1.1.1.1 1983064-2.2.1.1.1.1; 4211912-2.2.1.1.1.1 True 202 202 20 20 2 2 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger
Contents of match_results.tab
:
seq freq sample file_path kmername is_pos_kmer subtype refposition is_kmer_freq_okay scheme ACGGTAAAAGAGGACTTGACTGGCGCGATTTGC 68 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 21097-2.2.1.1.1 True 2.2.1.1.1 21097 True heidelberg AACCGGCGGTATTGGCTGCGGTAAAAGTACCGT 77 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 157792-2.2.1.1.1 True 2.2.1.1.1 157792 True heidelberg CCGCTGCTTTCTGAAATCGCGCGTCGTTTCAAC 67 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 293728-2.2.1.1 True 2.2.1.1 293728 True heidelberg GAATAACAGCAAAGTGATCATGATGCCGCTGGA 91 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 607438-2.2.1 True 2.2.1 607438 True heidelberg CAGTTTTACATCCTGCGAAATGCGCAGCGTCAA 87 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 691203-2.2.1.1 True 2.2.1.1 691203 True heidelberg CAGGAGAAAGGATGCCAGGGTCAACACGTAAAC 33 SRR5646583 SRR5646583_forward.fastqsanger; SRR5646583_reverse.fastqsanger 944885-2.2.1.1.1 True 2.2.1.1.1 944885 True heidelberg [Next 200 lines omitted.]
hansel -s heidelberg -vv --threads <n_cpu> -o results.tab -O match_results.tab -D /path/to/fastas_or_fastqs/
biohansel
will only attempt to analyze the FASTA/FASTQ files within the specified directory and will not descend into any subdirectories!
Add subtype metadata to your analysis results with -M your-subtype-metadata.tsv:
hansel -s heidelberg \
-M your-subtype-metadata.tsv \
-o results.tab \
-O match_results.tab \
-D ~/your-reads-directory/
Your metadata table must contain a field with the field name subtype, e.g.
subtype | host_association | geoloc | genotype_alternative |
---|---|---|---|
1 | human | Canada | A |
2 | cow | USA | B |
biohansel
accepts metadata table files with the following formats and extensions:
Format | Extension | Example Filename |
---|---|---|
Tab-delimited table/tab-separated values (TSV) | .tsv | my-metadata-table.tsv |
Tab-delimited table/tab-separated values (TSV) | .tab | my-metadata-table.tab |
Comma-separated values (CSV) | .csv | my-metadata-table.csv |
Get the latest development code using Git from GitHub:
git clone https://github.com/phac-nml/biohansel.git
cd biohansel/
git checkout development
# Create a virtual environment (virtualenv) for development
virtualenv -p python3 .venv
# Activate the newly created virtualenv
source .venv/bin/activate
# Install biohansel into the virtualenv in "editable" mode
pip install -e .
Run tests with pytest:
# In the biohansel/ root directory, install pytest for running tests
pip install pytest
# Run all tests in tests/ directory
pytest
# Or run a specific test module
pytest -s tests/test_qc.py
Copyright Government of Canada 2017
Written by: National Microbiology Laboratory, Public Health Agency of Canada
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Gary van Domselaar: [email protected]