Skip to content

Commit

Permalink
Merge pull request #144 from phac-nml/development
Browse files Browse the repository at this point in the history
Release v2.6.0
  • Loading branch information
peterk87 authored Mar 5, 2021
2 parents d20a00b + 28f8bef commit e7f5b14
Show file tree
Hide file tree
Showing 23 changed files with 840 additions and 752 deletions.
39 changes: 25 additions & 14 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,19 +121,27 @@ If you run ``hansel -h``, you should see the following usage statement:
[-i fasta_path genome_name] [-D INPUT_DIRECTORY]
[-o OUTPUT_SUMMARY] [-O OUTPUT_KMER_RESULTS]
[-S OUTPUT_SIMPLE_SUMMARY] [--force] [--json]
[--min-kmer-freq MIN_KMER_FREQ]
[--min-kmer-freq MIN_KMER_FREQ] [--min-kmer-frac MIN_KMER_FRAC]
[--max-kmer-freq MAX_KMER_FREQ]
[--low-cov-depth-freq LOW_COV_DEPTH_FREQ]
[--max-missing-kmers MAX_MISSING_KMERS]
[--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS]
[--low-cov-warning LOW_COV_WARNING]
[--max-intermediate-kmers MAX_INTERMEDIATE_KMERS]
[--max-degenerate-kmers MAX_DEGENERATE_KMERS] [-t THREADS]
[-v] [-V]
[--max-intermediate-kmers MAX_INTERMEDIATE_KMERS]
[--max-degenerate-kmers MAX_DEGENERATE_KMERS] [-t THREADS] [-v]
[-V]
[F [F ...]]
Subtype microbial genomes using SNV targeting k-mer subtyping schemes.
Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg, Enteritidis, Typhi, and Typhimurium subtyping. Also includes a Mycobacterium tuberculosis scheme called 'tb_lineage'.
BioHansel version 2.5.1: Subtype microbial genomes using SNV targeting k-mer subtyping schemes.
Built-in schemes:
* heidelberg: Salmonella enterica spp. enterica serovar Heidelberg
* enteritidis: Salmonella enterica spp. enterica serovar Enteritidis
* typhimurium: Salmonella enterica spp. enterica serovar Typhimurium
* typhi: Salmonella enterica spp. enterica serovar Typhi
* tb_lineage: Mycobacterium tuberculosis
Developed by Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Kim Ziebell, Anil Nichani, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash.
positional arguments:
Expand All @@ -143,17 +151,19 @@ If you run ``hansel -h``, you should see the following usage statement:
-h, --help show this help message and exit
-s SCHEME, --scheme SCHEME
Scheme to use for subtyping (built-in: "heidelberg",
"enteritidis", "typhi", "typhimurium", "tb_lineage"; OR user-specified:
/path/to/user/scheme)
"enteritidis", "typhi", "typhimurium", "tb_lineage";
OR user-specified: /path/to/user/scheme)
--scheme-name SCHEME_NAME
Custom user-specified SNP substyping scheme name
-M SCHEME_METADATA, --scheme-metadata scheme_metadata
Scheme subtype metadata table (.TSV format accepted;
must contain column called "subtype")
-M SCHEME_METADATA, --scheme-metadata SCHEME_METADATA
Scheme subtype metadata table (tab-delimited file with
".tsv" or ".tab" extension or CSV with ".csv"
extension format accepted; MUST contain column called
"subtype")
-p forward_reads reverse_reads, --paired-reads forward_reads reverse_reads
FASTQ paired-end reads
-i fasta_path genome_name, --input-fasta-genome-name fasta_path genome_name
fasta file path to genome name pair
input fasta file path AND genome name
-D INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY
directory of input fasta files (.fasta|.fa|.fna) or
FASTQ files (paired FASTQ should have same basename
Expand All @@ -169,6 +179,8 @@ If you run ``hansel -h``, you should see the following usage statement:
--json Output JSON representation of output files
--min-kmer-freq MIN_KMER_FREQ
Min k-mer freq/coverage
--min-kmer-frac MIN_KMER_FRAC
Proportion of k-mer required for detection (0.0 - 1)
--max-kmer-freq MAX_KMER_FREQ
Max k-mer freq/coverage
--low-cov-depth-freq LOW_COV_DEPTH_FREQ
Expand All @@ -188,7 +200,7 @@ If you run ``hansel -h``, you should see the following usage statement:
to be considered an intermediate subtype. (0.0 - 1.0)
--max-degenerate-kmers MAX_DEGENERATE_KMERS
Maximum number of scheme k-mers allowed before
quitting with a usage warning. Default is 100,000
quitting with a usage warning. Default is 100000
-t THREADS, --threads THREADS
Number of parallel threads to run analysis (default=1)
-v, --verbose Logging verbosity level (-v == show warnings; -vvv ==
Expand All @@ -197,7 +209,6 @@ If you run ``hansel -h``, you should see the following usage statement:
Example Usage
=============

Expand Down
21 changes: 15 additions & 6 deletions bio_hansel/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
# -*- coding: utf-8 -*-

__version__ = '2.5.0'
__version__ = '2.6.0'
program_name = 'bio_hansel'
program_summary = 'Subtype microbial genomes using SNV targeting k-mer subtyping schemes.'
program_desc = program_summary + '''
Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg, Enteritidis, Typhi, and Typhimurium subtyping. Also includes a Mycobacterium tuberculosis scheme called 'tb_lineage'.
Developed by Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Kim Ziebell, Anil Nichani, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash.
'''
program_summary = f'BioHansel version {__version__}: Subtype microbial genomes using SNV targeting k-mer subtyping ' \
f'schemes. '
program_desc = (f'{program_summary}\n\n'
f'Built-in schemes:\n\n'
f'* heidelberg: Salmonella enterica spp. enterica serovar Heidelberg\n'
f'* enteritidis: Salmonella enterica spp. enterica serovar Enteritidis\n'
f'* typhimurium: Salmonella enterica spp. enterica serovar Typhimurium\n'
f'* typhi: Salmonella enterica spp. enterica serovar Typhi\n'
f'* tb_lineage: Mycobacterium tuberculosis\n\n'
'Developed by Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, '
'James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, '
'Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, '
'Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Kim Ziebell, '
'Anil Nichani, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash.')
74 changes: 12 additions & 62 deletions bio_hansel/aho_corasick/__init__.py
Original file line number Diff line number Diff line change
@@ -1,61 +1,12 @@
# -*- coding: utf-8 -*-

from collections import defaultdict
from itertools import product
import logging
import sys

from ahocorasick import Automaton
import pandas as pd
from ahocorasick import Automaton

from ..parsers import parse_fasta, parse_fastq
from ..utils import revcomp
from ..const import bases_dict
from ..subtyping_params import SubtypingParams

def expand_degenerate_bases(seq):
"""List all possible kmers for a scheme given a degenerate base
Args:
Scheme_kmers from SNV scheme fasta file
Returns:
List of all possible kmers given a degenerate base or not
"""

return list(map("".join, product(*map(bases_dict.get, seq))))

def check_total_kmers(scheme_fasta, subtyping_params):
'''Checks that the number of kmers about to be created is not at too high a computation or time cost
Args:
scheme_fasta: Kmer sequences from the SNV scheme
Subtyping parameter max_degenerate_kmers: The max kmers allowed by the scheme
Returns:
None if created kmers < max degenerate kmers argument
Exits code with warning if created kmers > max degenerate kmers argument
'''
kmer_number = 0
for header, sequence in parse_fasta(scheme_fasta):
value = 1
for char in sequence:
length_key = len(bases_dict[char])
value = value * length_key
kmer_number = kmer_number + value
if kmer_number*2 > subtyping_params.max_degenerate_kmers:
return logging.error(
'''
Your current scheme contains "{}" kmers which is over the current max-degenerate-kmers check of "{}" (Maximum recommended k-mers is 100000).
It is not advised to run this scheme due to the time and memory usage required to give an output with this many kmers loaded.
If you still want to run this scheme, add the command line check of "--max-degenerate-kmers {}" at the end of your previous command.
'''.format(
kmer_number*2,
subtyping_params.max_degenerate_kmers,
kmer_number*2+1
)), sys.exit()
else:
return None
from ..utils import revcomp, expand_degenerate_bases


def init_automaton(scheme_fasta):
Expand All @@ -77,29 +28,29 @@ def init_automaton(scheme_fasta):
return A


def find_in_fasta(A: Automaton, fasta: str) -> pd.DataFrame:
def find_in_fasta(automaton: Automaton, fasta: str) -> pd.DataFrame:
"""Find scheme kmers in input fasta file
Args:
A: Aho-Corasick Automaton with scheme SNV target kmers loaded
automaton: Aho-Corasick Automaton with scheme SNV target kmers loaded
fasta: Input fasta path
Returns:
Dataframe with any matches found in input fasta file
"""
res = []
for contig_header, sequence in parse_fasta(fasta):
for idx, (kmername, kmer_seq, is_revcomp) in A.iter(sequence):
for idx, (kmername, kmer_seq, is_revcomp) in automaton.iter(sequence):
res.append((kmername, kmer_seq, is_revcomp, contig_header, idx))
df = pd.DataFrame(res, columns=['kmername', 'seq', 'is_revcomp', 'contig_id', 'match_index'])
return df
columns = ['kmername', 'seq', 'is_revcomp', 'contig_id', 'match_index']
return pd.DataFrame(res, columns=columns)


def find_in_fastqs(A: Automaton, *fastqs):
def find_in_fastqs(automaton: Automaton, *fastqs):
"""Find scheme kmers in input fastq files
Args:
A: Aho-Corasick Automaton with scheme SNV target kmers loaded
automaton: Aho-Corasick Automaton with scheme SNV target kmers loaded
fastqs: Input fastq file paths
Returns:
Expand All @@ -108,11 +59,10 @@ def find_in_fastqs(A: Automaton, *fastqs):
kmer_seq_counts = defaultdict(int)
for fastq in fastqs:
for _, sequence in parse_fastq(fastq):
for idx, (_, kmer_seq, _) in A.iter(sequence):
for idx, (_, kmer_seq, _) in automaton.iter(sequence):
kmer_seq_counts[kmer_seq] += 1
res = []
for kmer_seq, freq in kmer_seq_counts.items():
kmername, sequence, _ = A.get(kmer_seq)
kmername, sequence, _ = automaton.get(kmer_seq)
res.append((kmername, kmer_seq, freq))
df = pd.DataFrame(res, columns=['kmername', 'seq', 'freq'])
return df
return pd.DataFrame(res, columns=['kmername', 'seq', 'freq'])
124 changes: 58 additions & 66 deletions bio_hansel/const.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# -*- coding: utf-8 -*-

import re

from pkg_resources import resource_filename

from bio_hansel import program_name
Expand All @@ -13,82 +14,73 @@
'version': '1.0.7',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
'typhi': {'file': resource_filename(program_name, 'data/typhi/kmers.fasta'),
'version': '1.3.0',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
'version': '1.3.0',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
'tb_lineage': {'file': resource_filename(program_name, 'data/tb_lineage/kmers.fasta'),
'version': '1.0.5',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
'version': '1.0.5',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)},
'typhimurium': {'file': resource_filename(program_name, 'data/typhimurium/kmers.fasta'),
'version': '0.5.5',
'subtyping_params': SubtypingParams(low_coverage_depth_freq=20)}}


bases_dict = {
'A': ['A'],
'C': ['C'],
'G': ['G'],
'T': ['T'],
'R': ['A', 'G'],
'Y': ['C', 'T'],
'S': ['G', 'C'],
'W': ['A', 'T'],
'K': ['G', 'T'],
'M': ['A', 'C'],
'B': ['C', 'G', 'T'],
'D': ['A', 'G', 'T'],
'H': ['A', 'C', 'T'],
'V': ['A', 'C', 'G'],
'N': ['A', 'C', 'G', 'T'],}

'A': ['A'],
'C': ['C'],
'G': ['G'],
'T': ['T'],
'R': ['A', 'G'],
'Y': ['C', 'T'],
'S': ['G', 'C'],
'W': ['A', 'T'],
'K': ['G', 'T'],
'M': ['A', 'C'],
'B': ['C', 'G', 'T'],
'D': ['A', 'G', 'T'],
'H': ['A', 'C', 'T'],
'V': ['A', 'C', 'G'],
'N': ['A', 'C', 'G', 'T'], }

COLUMNS_TO_REMOVE = '''
pident
length
mismatch
gapopen
qstart
qend
sstart
send
evalue
bitscore
qlen
slen
sseq
is_trunc
coverage
'''.strip().split('\n')
COLUMNS_TO_REMOVE = ['pident',
'length',
'mismatch',
'gapopen',
'qstart',
'qend',
'sstart',
'send',
'evalue',
'bitscore',
'qlen',
'slen',
'sseq',
'is_trunc',
'coverage', ]

# These are present within the subtype module.
SUBTYPE_SUMMARY_COLS = """
sample
scheme
scheme_version
subtype
all_subtypes
kmers_matching_subtype
are_subtypes_consistent
inconsistent_subtypes
n_kmers_matching_all
n_kmers_matching_all_expected
n_kmers_matching_positive
n_kmers_matching_positive_expected
n_kmers_matching_subtype
n_kmers_matching_subtype_expected
file_path
avg_kmer_coverage
qc_status
qc_message
""".strip().split('\n')

SUBTYPE_SUMMARY_COLS = ['sample',
'scheme',
'scheme_version',
'subtype',
'all_subtypes',
'kmers_matching_subtype',
'are_subtypes_consistent',
'inconsistent_subtypes',
'n_kmers_matching_all',
'n_kmers_matching_all_expected',
'n_kmers_matching_positive',
'n_kmers_matching_positive_expected',
'n_kmers_matching_subtype',
'n_kmers_matching_subtype_expected',
'file_path',
'avg_kmer_coverage',
'qc_status',
'qc_message', ]

SIMPLE_SUMMARY_COLS = """
sample
subtype
coverage
qc_status
qc_message
""".strip().split('\n')
SIMPLE_SUMMARY_COLS = ['sample',
'subtype',
'coverage',
'qc_status',
'qc_message', ]

REGEX_FASTQ = re.compile(r'^(.+)\.(fastq|fq|fastqsanger)(\.gz)?$')
REGEX_FASTA = re.compile(r'^.+\.(fasta|fa|fna|fas)(\.gz)?$')
Expand Down
Loading

0 comments on commit e7f5b14

Please sign in to comment.