Releases · phac-nml/ecoli_serotyping

12 Dec 12:30

kbessonov1984

2.0.0

21f2cbd

v2.0.0: Pathotyping and Shiga toxin typing Latest

Latest

Updated species identification module now based on GTDB + custom Escherichia and Shigella sketch covering all known bacterial species
Implemented pathotyping covering 7 DEC Escherichia coli pathotypes (DAEC, EAEC, EHEC, EIEC, EPEC, ETEC and STEC) supporting simultaneous presence of multiple signatures (e.g. ETEC/STEC). Note that EHEC is reported as EHEC-STEC as this is a more severe subtype of STEC.
Implemented Shiga 1 and 2 toxin typing supporting multiple toxin signatures present in a single sample.
- A total of 4 stx1 subtypes are supported: stx1a, stx1c, stx1d and stx1e.
- A total of 15 stx2 subtypes are supported: stx2a, stx2b, stx2c, stx2d, stx2e, stx2f, stx2g ,stx2h, stx2i, stx2j, stx2k, stx2l, stx2m, stx2n, stx2o.
new database of pathotypes and toxins in JSON clear transparent format composed of the key virulence factors based on both BioNumerics and literature sources
support for gzip compressed inputs fastq.gz and fasta.gz saving storage and increasing versatility
other toxin typing covering enterohemolysin A (ehxA), hemolysin E (hlyE), hemolysin A (hlyA)
support for long raw reads improving mapping capabilities of bowtie2

Assets 2

24 Apr 05:39

kbessonov1984

1.0.0

0aac51a

v1.0.0: E.coli serotyping with QC module and adaptive thresholding

Major improvements:

Incorporation of Quality Control module allowing for easier results interpretation and any need for correction measure (re-sequencing, wet-lab serotyping). Unique thresholding at allele level allowing to determine if a given allele and query quality parameters (%identity and %coverage) are sufficient to resolve an antigen call unambiguously.
Cluster friendly behaviour supporting multiple instances via a .lock file preventing racing conditions and simultaneous database update via several instances
An updated database of alleles with the removal of duplicated or truncated alleles (e.g. O157 antigen)
Improved species identification resolution for highly similar non-Ecoli species such as Shigella and E.albertii. Now species identification is only done via MASH NCBI RefSeq sketch (https://gembox.cbcb.umd.edu/mash/refseq.genomes.k21s1000.msh)
Users can add new alleles to an existing allele database and make serotype predictions via custom allele database thanks to --dbpath parameter
Improved O and H antigens call rates and accuracy thanks to decoupling of %identity and %coverage thresholds for each antigen. Now global thresholds could be specified separately. This is especially important if one of the antigen genes (e.g. wzx/wzy or fliC, etc) is truncated or has low coverage
Improved adaptive O antigen calling rates if only a single O antigen candidate in preliminary BLAST results is available making accurate O antigen call even in poorly sequenced samples with minimal coverage.
Addition of mixed O antigen calls for highly similar O antigens (e.g. O17/O77)
Allele names/keys used to make antigen calls are also reported making easier troubleshooting for dubious alleles and alleles database cleaning
More detailed error messages and support for 16 high similarity O-antigens (%identity > 99%) based on the reference publication PMID: 25428893

Assets 2

07 Dec 15:05

kbessonov1984

0.9.1

a7e67b6

Minor bugs correction in species identification and increased robustness of the --verify switch

Merge pull request #78 from kbessonov1984/master

Version 0.9.1 addressing minor issues on species identification and fasta files handling

Assets 2

05 Oct 23:38

kbessonov1984

0.9.0

bcc0e6a

E.coli serotyping with ability to differentiate between Shigella and other Escherichia cryptic species

improved O-antigen serotyping coverage of complex samples that lack some O-antigen signatures
better complex cases handling and error recovery in cases of poor reference allele coverage
improved O-antigen identification precision favoring the presence of both alleles (e.g. wzx and wzy) to support the final call. The sum of scores for both alleles of the same antigen is used in ranking now
automatic download and update of RefSeq genome sketches every 6 months
addition of Quality Control flags in the output (as an extra column in the results.tsv) for ease of results interpretation
improved species identification for the FASTQ files. All raw reads are used for species identification
query length coverage default threshold lowered from 50% to 10% to account for truncated alleles. This greatly improved the sensitivity of the tool while not changing significantly specificity
wrote additional unit tests to cover all aspects of the program
file lock application when updating RefSeq sketch and assembly stats files