Skip to content

Commit

Permalink
Updated README.md to follow DAAD format and also updated url to Speci…
Browse files Browse the repository at this point in the history
…es ID MASH database now hosted in Zenodo.org
  • Loading branch information
kbessonov1984 committed Sep 12, 2024
1 parent d70a025 commit 1d3ba78
Show file tree
Hide file tree
Showing 5 changed files with 149 additions and 68 deletions.
176 changes: 129 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,77 @@
`ECTyper` is a standalone versatile serotyping module for _Escherichia coli_. It supports both _fasta_ (assembled) and _fastq_ (raw reads) file formats.
The tool provides convenient species identification coupled to quality control module giving a complete, transparent and reference laboratories suitable report on E.coli serotyping.

# Introduction
*Escherichia coli* is a priority foodborne pathogen of public health concern and popular model organism. Phenotypic characterization such as serotyping, toxin typing and pathotyping provide critical information for surveillance and outbreak detection activities and research including source attribution, outbreak cluster assignment, pathogenicy potential, risk assessement and others.

# Dependencies:
`ECTyper` uses whole-genome sequencing (WGS) for E.coli characterizion including species identification, *in silico* serotyping covering O and H antigens, Shiga toxin typing and DEC pathotyping. It is a versatile, scallable, easy to use tool allowing to obtain key information on E.coli accepting both raw and assembled inputs.

As WGS becomes standard within public health and research laboratories, it is important to harness the high thourghput and resolution potential of this technology providing accurate and rapid at scale typing of E.coli both in public health, clinical and research contexts.

## Citation
Bessonov, Kyrylo, Chad Laing, James Robertson, Irene Yong, Kim Ziebell, Victor PJ Gannon, Anil Nichani, Gitanjali Arya, John HE Nash, and Sara Christianson. "ECTyper: in silico Escherichia coli serotype and species prediction from raw and assembled whole-genome sequence data." Microbial genomics 7, no. 12 (2021): 000728. [https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000728](https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000728)

## Contact
For any questions, issues or comments please make a Github issue or reach out to [Kyrylo Bessonov]([email protected]).

# Installation
Multiple installation options are available depending on the user context and needs. The most convinient installation is as a `conda` package as it will install all required dependencies.

### Images
Docker and Singularity images are also available from [https://biocontainers.pro/tools/ectyper](https://biocontainers.pro/tools/ectyper) that could be useful for NextFlow or hassle-free deployment

### Databases
ECTyper uses multiple databases
- the species identification database is available from [https://zenodo.org/records/10211569](https://zenodo.org/records/10211569)
- the O and H antigen allele sequences are stored in [ectyper_alleles_db.json](ectyper/Data/ectyper_alleles_db.json)
- the toxin and pathotype signature marker sequences are stored in [ectyper_patho_stx_toxin_typing_database.json](ectyper/Data/ectyper_patho_stx_toxin_typing_database.json)

## Option 1: As a conda package
Optionally if you do not have a conda environment, get and install `miniconda` or `anaconda`:

```
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p $HOME/miniconda
echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc
source ~/.bashrc
```

Install the latest `ectyper` conda package from a `bioconda` channel

```
conda install -c bioconda ectyper
```

## Option 2: Install using pip
Install using `pip3` utility including python but missing on [non-python dependencies](#dependencies)
```
pip3 install ectyper
```
## Option 3: From source code
Second option is to install from the source allowing to excercise maximum control over installation process.

Install dependencies. On Ubuntu distro run
```
apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk
```

Install python dependencies via `pip`:
```
pip3 install pandas biopython
```
Clone the repository or checkout a particular release (e.g `v1.0.0`, `v2.0.0` etc.):
```
git clone https://github.com/phac-nml/ecoli_serotyping.git
git checkout v1.0.0 #optionally checkout a specific release version
```

Finally, install ectyper
```
python3 setup.py install # option 1
pip3 install . # option 2
```
## Compatibility
### Dependencies:
- python >= 3.5
- bcftools >= 1.8
- blast == 2.7.1
Expand All @@ -19,58 +88,26 @@ The tool provides convenient species identification coupled to quality control m
- bowtie2 >= 2.3.4.1
- mash >= 2.0

# Python packages:
### Python packages:
- biopython >= 1.70
- pandas >= 0.23.1
- requests >= 2.0


# Installation

## Option 1: As a conda package
1. If you do not have conda environment, get and install `miniconda` or `anaconda`:

```wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh -b -p $HOME/miniconda
echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc
source ~/.bashrc```
2. Install conda package from `bioconda` channel
```conda install -c bioconda ectyper```
## Option 2: From the source directly
Second option is to install from the source.
1. Install dependencies. On Ubuntu distro run
```
apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk
```
1. Install python dependencies via `pip`:

```
pip3 install pandas biopython
```

1. Clone the repository or checkout a particular release (e.g v1.0.0, etc.):

```
git clone https://github.com/phac-nml/ecoli_serotyping.git
git checkout v1.0.0 #optionally checkout release version
```

1. Install ectyper: `python3 setup.py install`

# Basic Usage
# Getting started
## Basic Usage
1. Put the fasta/fastq files for serotyping analyses in one folder (concatenate paired raw reads files if you would like them to be considered a single entity)
1. `ectyper -i [file path] -o [output_dir]`
1. View the results on the console or in `cat [output folder]/output.csv`

# Example Usage
* `ectyper -i ecoliA.fasta` for a single file
* `ectyper -i ecoliA.fasta -o output_dir` for a single file, results stored in `output_dir`
* `ectyper -i ecoliA.fasta,ecoliB.fastq,ecoliC.fna` for multiple files (comma-delimited)
* `ectyper -i ecoli_folder` for a folder (all files in the folder will be checked by the tool)
## Example Input Scenarios
* `ectyper -i ecoliA.fasta` for a single file (the output folder will be named using `ectyper_<date>_<time>` pattern)
* `ectyper -i ecoliA.fasta -o output_dir` for a single file, results stored in `output_dir` folder
* `ectyper -i ecoliA.fasta ecoliB.fastq ecoli_folder/` for multiple files and directory separated by space
* `ectyper -i ecoliA.fasta ecoliB.fastq,ecoliC.fna`
* `ectyper -i ecoli_folder` scan for input files in a folder and subdirectories (all files in the folder will be checked by the tool)
* `ectyper -i ecoli_folder/*.fasta` scan for FASTA input files in a folder and subdirectories

# Advanced Usage
## Advanced Usage
```
usage: ectyper [-h] [-V] -i INPUT [-c CORES] [-opid PERCENTIDENTITYOTYPE]
[-hpid PERCENTIDENTITYHTYPE] [-oplen PERCENTLENGTHOTYPE]
Expand Down Expand Up @@ -112,7 +149,8 @@ optional arguments:
Data/ectyper_database.json for more information
```

# Fine-tunning parameters

## Configuration and fine-tunning parameters
`ECTyper` requires minimum options to run (`-i` and `-o`) but allows for extensive configuration to accomodate wide variaty of typing scenarios

| Parameter| Explanation | Usage scenario |
Expand All @@ -125,8 +163,23 @@ optional arguments:
| `-r` | Specify custom MASH sketch of reference genomes that will be used for species inference | User has a new assembled genome that is not available in NCBI RefSeq database. Make sure to add metadata to `assembly_summary_refseq.txt` and provide custom accession number that start with `GCF_` prefix|
|`--dbpath`| Provide custom appended database of O and H antigen reference alleles in JSON format following structure and field names as default database `ectyper_alleles_db.json` | User wants to add new alleles to the alleles database to improve typing performance |

# Data Input
Both raw and assembled reads are accepted in FASTA and FASTQ formats from any sequencing platform. The tool was designed for single sample inputs, but was shown to work on multi-taxa metagenomic raw reads FASTQ inputs.

# Quality Control (QC) module
# Data Output
The output of the tool is stored in text files with the main report stored in `output.tsv` tab-delimited text file.

The BLASTN hits of the O and H antigen database are stored in `blastn_output_alleles.txt` tab-delimited file.

The log messages are stored in `ectyper.log` text file
```
{out folder name}
├── blastn_output_alleles.txt
├── ectyper.log
└── output.tsv
```

## Quality Control (QC) module
To provide an easier interpretation of the results and typing metrics, following QC codes were developed.
These codes allow to quickly filter "reportable" and "non-reportable" samples. The QC module is tightly linked to ECTyper allele database, specifically, `MinPident` and `MinPcov` fields.
For each reference allele minimum `%identity` and `%coverage` values were determined as a function of potential "cross-talk" between antigens (i.e. multiple potential antigen calls at a given setting).
Expand All @@ -144,7 +197,7 @@ The QC module covers the following serotyping scenarios. More scenarios might be
|WARNING (H NON-REPORT)|H antigen alleles do not meet min %id or %cov thresholds|
|WARNING (O and H NON-REPORT)| Both O and H antigen alleles do not meet min %identity or %coverage thresholds|

# Report format
## Report format
`ECTyper` capitalizes on a concise minimum output coupled to easy results interpretation and reporting. `ECTyper v1.0` serotyping results are available in a tab-delimited `output.tsv` file consisting of the 16 columns listed below:

1. **Name**: Sample name (usually a unique identifier)
Expand Down Expand Up @@ -173,6 +226,24 @@ Selected columns from the `ECTyper` typical report are shown below.
EC20151709|Escherichia coli|O157:H43|Based on 3 allele(s)|PASS (REPORTABLE)|wzx:1;wzy:0.999;fliC:1|O157-5-wzx-origin;O157-9-wzy-origin;H43-1-fliC-origin;|100;99.916;99.934; | 100;100;100; | contig00002;contig00002;contig00003; | 62558-63949;64651-65835;59962-61467; | 1392;1185;1506; |v1.0 (2020-05-07) | - |


FAQs

## FAQ

**Does ECTyper can be run on multiple samples in a directory?**

ECTyper proves flexible ways to specify inputs located in different locations. One can provide multiple paths to several directories separated by space. In addition, one can specify file type to look for in a given diretory(ies). Note that paths that contain a star `*` symbol would only look for files in specified directory and would not look in subdirectories. For example,

- Process all files in `folder1` and `folder2` directories and file `sample.fasta` located in `folder3`

`ectyper -i folder1/ folder2/ folder3/sample.fasta -o ectyper_results`
- Process all fasta files in `folder1` and all fastq files in `folder2`. All sub-directories in those 2 folders will be ignored. To process those sub-folders either specify path to them or provide paths to directories without the `*` wildcard symbol.

`ectyper -i folder1/*.fasta folder2/*.fastq`

**Why ECTyper sometimes provides serotype results separated by forward slash / for O-antigen**

Some O-antigens display very high degree of homology and are very hard to discern even using wet-lab agglutination assays. Even using both `wzx` and `wzy` genes it is not possible to reliably resolve those O-antigens. The 16 high similarity groups were identified by [Joensen, Katrine G., et al.](https://journals.asm.org/doi/full/10.1128/jcm.00008-15). Thus, if a given O-antigen is a member of any of those high similarity groups, all potential O-antigens are reported separated by `/` such as group 9 reporeted as `O17/O44/O73/O77/O106`.


# Availability
Expand All @@ -188,3 +259,14 @@ EC20151709|Escherichia coli|O157:H43|Based on 3 allele(s)|PASS (REPORTABLE)|wzx:
|[Galaxy Europe](https://usegalaxy.eu/root?tool_id=ectyper)| Galaxy public server to execute your analysis from anywhere|Web-based|
|[IRIDA plugin](https://github.com/phac-nml/irida-plugin-ectyper)| IRIDA instances could easily install additional pipeline|Web-based|

# Legal and Compliance Information

Copyright Government of Canada 2024

Written by: National Microbiology Laboratory, Public Health Agency of Canada

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
2 changes: 1 addition & 1 deletion ectyper/commandLineOptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ def checkdbversion():

parser.add_argument(
"--maxdirdepth",
help="Maximum number of directories to descend when searching an input directory of files",
help="Maximum number of directories to descend when searching an input directory of files [default %(default)s levels]. Only works on path inputs not containing '*' wildcard",
default=0,
type=int,
required=False
Expand Down
2 changes: 1 addition & 1 deletion ectyper/definitions.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
'15':['O89','O101','O162'],
'16':['O169','O183']
}
MASH_URLS = ["https://drive.usercontent.google.com/download?id=1p0XVb7PuiApYk5ndjLksIc3RcDmUwi6L&export=download&confirm=f"]
MASH_URLS = ["https://zenodo.org/records/10211569/files/EnteroRef_GTDBSketch_20231003_V2.msh?download=1"]

HIGH_SIMILARITY_THRESHOLD_O = 0.00771 # alleles that are 99.23% apart will be reported as mixed call ~ 8 nt difference on average
MIN_O_IDENTITY_LS = 95 #low similarity group O antigen min identity threshold to pre-filter BLAST output (identical to global threshold)
Expand Down
20 changes: 11 additions & 9 deletions ectyper/ectyper.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ def run_program():
args = commandLineOptions.parse_command_line()


output_directory = create_output_directory(args.output)
output_directory = create_output_directory(args)

# Create a file handler for log messages in the output directory for the root thread
fh = logging.FileHandler(os.path.join(output_directory, 'ectyper.log'), 'w', 'utf-8')
Expand Down Expand Up @@ -121,6 +121,7 @@ def run_program():
os.makedirs(temp_dir, exist_ok=True)

LOG.info("Gathering genome files list ...")

input_files_list = genomeFunctions.get_files_as_list(args.input, args.maxdirdepth)
raw_genome_files = decompress_gunzip_files(input_files_list, temp_dir)

Expand Down Expand Up @@ -256,9 +257,9 @@ def getOantigenHighSimilarGroup(final_predictions, sample):



def create_output_directory(output_dir):
def create_output_directory(args):
"""
Create the output directory for ectyper
Create the output directory for ectyper if does not exist already
:param output_dir: The user-specified output directory, if any
:return: The output directory
Expand All @@ -267,26 +268,27 @@ def create_output_directory(output_dir):



if output_dir is None:
if args.output is None:
date_dir = ''.join([
'ectyper_',
str(datetime.datetime.now().date()),
'_',
str(datetime.datetime.now().time()).replace(':', '.')
])
out_dir = os.path.join(definitions.WORKPLACE_DIR, date_dir)
args.output = out_dir
else:
if os.path.isabs(output_dir):
out_dir = output_dir
if os.path.isabs(args.output):
out_dir = args.output
else:
out_dir = os.path.join(definitions.WORKPLACE_DIR, output_dir)
out_dir = os.path.join(definitions.WORKPLACE_DIR, args.output)

if not os.path.exists(out_dir):
os.makedirs(out_dir)

# clean previous ECTyper output files if the directory was used in previous runs
for file in definitions.OUTPUT_FILES_LIST:
path2file = os.path.join(output_dir,file)
path2file = os.path.join(out_dir,file)
if os.path.exists(path2file):
LOG.info(f"Cleaning ECTyper previous files. Removing previously generated {path2file} ...")
os.remove(path2file)
Expand Down
17 changes: 7 additions & 10 deletions ectyper/genomeFunctions.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
'''

import logging
import os
import os, glob
import tempfile
from tarfile import is_tarfile
from Bio import SeqIO
Expand All @@ -28,7 +28,7 @@ def get_files_as_list(files_or_directories, max_depth_level):
directory specified (where each file name is its absolute path).
Args:
file_or_directory (str): file or directory name given on commandline
file_or_directory (str): file or directory name given on command line
Returns:
files_list (list(str)): List of all the files found.
Expand All @@ -38,11 +38,9 @@ def get_files_as_list(files_or_directories, max_depth_level):


init_min_dir_level = min([os.path.abspath(p).count(os.sep)+1 if os.path.isdir(p) else os.path.abspath(p).count(os.sep) for p in files_or_directories])

for file_or_directory in sorted([os.path.abspath(p) for p in files_or_directories if len(p) != 0]):

dir_level_current = get_relative_directory_level(file_or_directory, init_min_dir_level)

if dir_level_current > max_depth_level:
LOG.info(f"Directory level exceeded ({dir_level_current} > {max_depth_level}), skipping {file_or_directory} ...")
continue
Expand All @@ -53,9 +51,9 @@ def get_files_as_list(files_or_directories, max_depth_level):
# Create a list containing the file names
for root, dirs, files in os.walk(os.path.abspath(file_or_directory)):
dir_level = get_relative_directory_level(root, init_min_dir_level)
LOG.info(f"In '{root}' level {dir_level} identified {len(dirs)} sub-directory(ies) and {len(files)} file(s) ...")
if dir_level > max_depth_level:
continue
LOG.info(f"In '{root}' level {dir_level} identified {len(dirs)} sub-directory(ies) and {len(files)} file(s) ...")
for filename in files:
files_list.append(os.path.join(root, filename))
# check if input is concatenated file locations separated by , (comma)
Expand All @@ -73,7 +71,6 @@ def get_files_as_list(files_or_directories, max_depth_level):
LOG.info(f"Total of {len(files_list)} files identified with a valid path and {missing_inputs_count} are missing ...")
# a path to a file is specified
else:
LOG.info("Checking existence of file " + file_or_directory)
input_abs_file_path = os.path.abspath(file_or_directory)
if os.path.exists(input_abs_file_path):
files_list.append(os.path.abspath(input_abs_file_path))
Expand All @@ -82,9 +79,9 @@ def get_files_as_list(files_or_directories, max_depth_level):


if not files_list:
LOG.critical("No files were found for the ectyper run")
LOG.critical("No files were found for the ectyper to run on")
raise FileNotFoundError("No files were found to run on")
LOG.info(f"Overall identified {len(files_list)} file(s) to process ...");
LOG.info(f"Overall identified {len(files_list)} file(s) ({','.join([os.path.basename(f) for f in files_list])}) to process ...");
sorted_files = sorted(list(set(files_list)))
LOG.debug(sorted_files)
return sorted_files
Expand Down Expand Up @@ -402,7 +399,7 @@ def create_combined_alleles_and_markers_file(alleles_fasta, temp_dir, pathotype)
"""

combined_file = os.path.join(temp_dir, 'combined_ident_serotype.fasta')
LOG.info("Creating combined serotype and identification fasta file")
LOG.info(f"Creating combined reference database fasta file at {combined_file} ...")

with open(combined_file, 'w') as ofh:
#with open(definitions.ECOLI_MARKERS, 'r') as mfh:
Expand Down

0 comments on commit 1d3ba78

Please sign in to comment.