Skip to content

Commit

Permalink
Merge branch 'main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
jennahamlin authored Oct 5, 2023
2 parents 9568b0d + b5c6791 commit 8b1363f
Show file tree
Hide file tree
Showing 6 changed files with 306 additions and 63 deletions.
59 changes: 51 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# el_gato
**E**pidemiology of ***L**egionella* : **G**enome-b**A**sed **T**yping:

[Add two sentence summary of what el gato is]
El_gato is a bioinformatics tool that utilizes either a genome assembly (.fasta) or Illumina paired-end reads (.fastq) to replicate *Legionella pneumophila* Sequence Based Typing (SBT). From the input, 7 loci (*flaA*, *pilE*, *asd*, *mip*, *mompS*, *proA*, *neuA/neuAh*) are identified and compared to a database of sequence types. The sequence type provided for each input sample is based on the unique combination of the allelic identities of the 7 target loci.

* [Installation](#installation)
* [Method 1: using conda](#method-1-using-conda)
Expand All @@ -21,8 +21,10 @@
* [identified_alleles.fna](#identified_allelesfna)
* [run.log](#runlog)
* [reads_vs_all_ref_filt_sorted.bam](#reads_vs_all_ref_filt_sortedbam-reads-only)
* [report.json](#reportjson)
* [How does el_gato work?](#approach)
* [Using Nextflow](#using-nextflow)
* [Reporting Module](#reporting-module)

Codebase stage: development
Developers and maintainers, Testers: [Andrew Conley](https://github.com/abconley), [Lavanya Rishishwar](https://github.com/lavanyarishishwar), [Emily T. Norris](https://github.com/norriset), [Anna Gaines](https://github.com/annagaines), [Will Overholt](https://github.com/waoverholt/), [Dev Mashruwala](https://github.com/dmashruwala), [Alan Collins](https://github.com/Alan-Collins)
Expand Down Expand Up @@ -72,6 +74,7 @@ Usage information printed when running el_gato.py with `-h` or `--help`.
usage: el_gato.py [--read1 Read 1 file] [--read2 Read 2 file] [--assembly Assembly file] [--help]
[--threads THREADS] [--depth DEPTH] [--out OUT] [--sample SAMPLE] [--overwrite] [--sbt SBT]
[--suffix SUFFIX] [--profile PROFILE] [--verbose] [--header] [--length LENGTH] [--sequence SEQUENCE]
[--samfile SAMFILE]
Legionella in silico sequence-based typing (SBT) script.
Requires paired-end reads files or a genome assembly.
Expand All @@ -96,7 +99,7 @@ Optional arguments:
--threads THREADS -t THREADS
Number of threads to run the programs (default: 1)
--depth DEPTH -d DEPTH
Variant read depth cutoff (default: 3)
Variant read depth cutoff (default: 10)
--out OUT -o OUT
Output folder name (default: out)
--sample SAMPLE -n SAMPLE
Expand All @@ -123,6 +126,10 @@ Optional arguments:
Specify BLAST hit percent identity threshold for
identifying multiple loci in an assembly
(default: 95.0)
--samfile SAMFILE -m SAMFILE
Allows the user to include the reads_vs_all_ref_filt.sam
file to be included in the output directory
(default: False)
```

# Input and Output
Expand Down Expand Up @@ -151,13 +158,15 @@ The ST column can contain two kinds of values. If the identified ST corresponds

The corresponding allele number is reported for each gene if an exact allele match is found in the database. Alternatively, el_gato may also note the following symbols:

[Table needs to be updated, but output does not match table in ppt]

| Symbol | Meaning |
|:------:|:---------|
|Novel ST | Novel Sequence Type: All 7 target genes were found, but not present in the profile - most likely a novel sequence type. |
|Novel ST* | Novel Sequence Type due to novel allele: One or multiple target genes have a novel allele found. |
|MD- | Missing Data: ST is unidentifiable as a result of or more of the target genes that are unidentifiable. |
|MA? | Multiple Alleles: ST is ambiguous due to multiple alleles that could not be resolved. |
| NAT | Novel Allele Type: BLAST cannot find an exact allele match - most likely a new allele. |
| - | Missing Data: Both percent and length identities are too low to return a match or N's in sequence. |
| ? | Multiple alleles: More than one allele was found and could not be resolved. |
| ? | Multiple Alleles: More than one allele was found and could not be resolved. |

If symbols are present in the ST profile, the other output files produced by el_gato will provide additional information to understand what is being communicated.

Expand All @@ -179,7 +188,7 @@ Headers are included in outputs for the samtools coverage command and blast resu
|:-------------:|:---------------------------------------------------:|
| rname | Locus name |
| numreads | Number reads aligned to the region (after filtering)|
| covbases | Number of covered bases with depth >= 1 |
| covbases | Number of covered bases with depth >= 10 |
| coverage | Percentage of covered bases [0..100] |
| meandepth | Mean depth of coverage |
| meanbaseq | Mean baseQ in covered region |
Expand Down Expand Up @@ -216,6 +225,17 @@ el_gato maps the provided reads to [a set of reference sequences in the el_gato

**Note:** A SAM file is also present, which has the same information as in the BAM file.

### report-json
Each sample outputs a json file that contains relevant information about the run that will be included in the report PDF.

Summary page metadata: Complete MLST profile of the sample and the abbreviation key for the symbols.

Run-specific data:

Paired-end reads: Locus coverage information and *mompS* primer information.

Assembly: BLAST hit length and sequence identity thresholds and locus location information.

# Approach

At its core, el_gato uses BLAST to identify the closest match to each allele in your input data. For the loci *flaA*, *pilE*, *asd*, *mip*, and *proA*, this process is straight forward. Whereas loci *mompS* and *neuA/neuAh* require more involved processing, with neuA/neuAh being an issue only when processing reads—the specifics of these loci are discussed in the corresponding sections below.
Expand Down Expand Up @@ -264,7 +284,7 @@ The sequence of the two copies of *mompS* and the identity of the correct allele

a. Only one allele has associated reads with primer *mompS*-1116R correctly oriented.

b. One allele has more than three times [still valid?] as many reads with correctly oriented primer as the other.
b. One allele has more than three times as many reads with correctly oriented primer as the other.

c. One allele has no associated reads with the primer *mompS*-1116R in either orientation, but the other allele has associated reads with the primer in only the wrong orientation. In this case, the allele with no associated reads with the primer in either orientation is considered the primary locus by the process of elimination.

Expand All @@ -288,4 +308,27 @@ nextflow run_el_gato.nf --reads_dir <path/to/reads/directory> --threads <threads
nextflow run_el_gato.nf --assembly_dir <path/to/assemblies/directory> --threads <threads> --out <path/to/output/directory> -profile singularity -c nextflow.config
```

**Note:** To run nextflow without the singularity container, uncomment conda environment installation on line 10 and line 47 of the run_el_gato.nf file.
**Note:** To run nextflow without the singularity container, uncomment conda environment installation on line 10 and line 47 of the run_el_gato.nf file and use the following commands:

```
# Reads
nextflow run_el_gato.nf --reads_dir <path/to/reads/directory> --threads <threads> --out <path/to/output/directory>
# Assemblies
nextflow run_el_gato.nf --assembly_dir <path/to/assemblies/directory> --threads <threads> --out <path/to/output/directory>
```
## Output files for Nextflow
At the completion of a run, the specified output directory (default: el_gato_out/) will contain a file named "all_mlst.txt" (the MLST profile of each sample) and one directory for each sample processed. Each sub-directory is named with a sample name and contains output files specific to that sample. These files include the el_gato log file and files providing more details about the sequences identified in the sample.

Additionally, the specified output directory will contain a combined json file (report.json) that contains all of the data from the individual sample-level json files and the report PDF (report.pdf).

# Reporting Module
## Dependencies
* [fpdf2](https://github.com/py-pdf/fpdf2)

We provide a script that generates a PDF report of each el_gato run using the report.json file generated in the output folder for each sample.
This report generated by Nextflow by default, but must be run manually if running el_gato for individual samples. If run manually, the report PDF
will output to the current working directory.
```
elgato_report.py <path/to/output/directory>/report.json
```
7 changes: 4 additions & 3 deletions conda-recipe/meta.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{% set name = "elgato" %}
{% set version = "1.11.1" %}
{% set version = "1.14.4" %}

package:
name: {{ name|lower }}
Expand All @@ -17,16 +17,17 @@ requirements:
- skip: True # [win]

host:
- python >=3.8
- python >=3.8,<3.12
- pip

run:
- python >=3.8
- python >=3.8,<3.12
- minimap2==2.24
- samtools==1.15.1
- blast==2.13.0
- ispcr==33.0
- nextflow
- fpdf2

# build:
# noarch: python
Expand Down
Loading

0 comments on commit 8b1363f

Please sign in to comment.