Releases: phac-nml/ecoli_serotyping
Releases · phac-nml/ecoli_serotyping
v2.0.0: Pathotyping and Shiga toxin typing
- Updated species identification module now based on GTDB + custom Escherichia and Shigella sketch covering all known bacterial species
- Implemented pathotyping covering 7 DEC Escherichia coli pathotypes (
DAEC
,EAEC
,EHEC
,EIEC
,EPEC
,ETEC
andSTEC
) supporting simultaneous presence of multiple signatures (e.g.ETEC/STEC
). Note thatEHEC
is reported asEHEC-STEC
as this is a more severe subtype ofSTEC
. - Implemented Shiga 1 and 2 toxin typing supporting multiple toxin signatures present in a single sample.
- A total of 4 stx1 subtypes are supported:
stx1a
,stx1c
,stx1d
andstx1e
. - A total of 15 stx2 subtypes are supported:
stx2a
,stx2b
,stx2c
,stx2d
,stx2e
,stx2f
,stx2g
,stx2h
,stx2i
,stx2j
,stx2k
,stx2l
,stx2m
,stx2n
,stx2o
.
- A total of 4 stx1 subtypes are supported:
- new database of pathotypes and toxins in JSON clear transparent format composed of the key virulence factors based on both BioNumerics and literature sources
- support for gzip compressed inputs
fastq.gz
andfasta.gz
saving storage and increasing versatility - other toxin typing covering enterohemolysin A (
ehxA
), hemolysin E (hlyE
), hemolysin A (hlyA
) - support for long raw reads improving mapping capabilities of
bowtie2
v1.0.0: E.coli serotyping with QC module and adaptive thresholding
Major improvements:
- Incorporation of Quality Control module allowing for easier results interpretation and any need for correction measure (re-sequencing, wet-lab serotyping). Unique thresholding at allele level allowing to determine if a given allele and query quality parameters (
%identity
and%coverage
) are sufficient to resolve an antigen call unambiguously. - Cluster friendly behaviour supporting multiple instances via a
.lock
file preventing racing conditions and simultaneous database update via several instances - An updated database of alleles with the removal of duplicated or truncated alleles (e.g. O157 antigen)
- Improved species identification resolution for highly similar non-Ecoli species such as Shigella and E.albertii. Now species identification is only done via MASH NCBI RefSeq sketch (https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh)
- Users can add new alleles to an existing allele database and make serotype predictions via custom allele database thanks to
--dbpath
parameter - Improved O and H antigens call rates and accuracy thanks to decoupling of
%identity
and%coverage
thresholds for each antigen. Now global thresholds could be specified separately. This is especially important if one of the antigen genes (e.g.wzx
/wzy
or fliC, etc) is truncated or has low coverage - Improved adaptive O antigen calling rates if only a single O antigen candidate in preliminary BLAST results is available making accurate O antigen call even in poorly sequenced samples with minimal coverage.
- Addition of mixed O antigen calls for highly similar O antigens (e.g. O17/O77)
- Allele names/keys used to make antigen calls are also reported making easier troubleshooting for dubious alleles and alleles database cleaning
- More detailed error messages and support for 16 high similarity O-antigens (%identity > 99%) based on the reference publication PMID: 25428893
Minor bugs correction in species identification and increased robustness of the --verify switch
Merge pull request #78 from kbessonov1984/master Version 0.9.1 addressing minor issues on species identification and fasta files handling
E.coli serotyping with ability to differentiate between Shigella and other Escherichia cryptic species
- improved O-antigen serotyping coverage of complex samples that lack some O-antigen signatures
- better complex cases handling and error recovery in cases of poor reference allele coverage
- improved O-antigen identification precision favoring the presence of both alleles (e.g.
wzx
andwzy
) to support the final call. The sum of scores for both alleles of the same antigen is used in ranking now - automatic download and update of RefSeq genome sketches every 6 months
- addition of Quality Control flags in the output (as an extra column in the results.tsv) for ease of results interpretation
- improved species identification for the FASTQ files. All raw reads are used for species identification
- query length coverage default threshold lowered from 50% to 10% to account for truncated alleles. This greatly improved the sensitivity of the tool while not changing significantly specificity
- wrote additional unit tests to cover all aspects of the program
- file lock application when updating RefSeq sketch and assembly stats files