From 98c47b08c6df891ca377d06588c211364eb9653b Mon Sep 17 00:00:00 2001 From: avantonder Date: Wed, 15 Nov 2023 17:33:26 +0000 Subject: [PATCH] update 151123 --- materials/02-know_your_bug.md | 6 + materials/03-nextflow.md | 2 - materials/05-intro_tb.md | 2 +- materials/07-bacqc.md | 12 + materials/09-bactmap.md | 31 -- materials/17-intro_staph.md | 4 + materials/18-de_novo_assembly.md | 1 - materials/21-assembly_qc.md | 61 ++-- materials/29-pathogenwatch.md | 12 +- materials/31-intro_amr.md | 387 ++++++++++++++++++++- materials/appendices/02-course_software.md | 8 + 11 files changed, 460 insertions(+), 66 deletions(-) diff --git a/materials/02-know_your_bug.md b/materials/02-know_your_bug.md index e68ce8b..d49f9f0 100644 --- a/materials/02-know_your_bug.md +++ b/materials/02-know_your_bug.md @@ -13,6 +13,12 @@ Before starting your analysis, it's important to understand the characteristics ![Flowchart showing different approaches to building a bacterial phylogenetic tree](images/know_your_bug.jpeg) +This week, we're going to work with three different bacterial species that each require a different analysis path through the diagram above. + +- _Mycobacterium tuberculosis_ - a monoclonal (i.e. relatively little genomic diversity) organism that doesn't recombine so we'll build a phylogenetic tree by mapping to a reference. +- _Staphylococcus aureus_ - a diverse organism with low levels of recombination. We're going to be working with a dataset consisting of more than a single lineage so we'll take the approach of building a core gene alignment to use to infer our phylogenetic tree. +- _Streptococcus pneumoniae_ - a diverse organism that frequently recombines. As the dataset consists of isolates from a single lineage/serotype, we're going to map to a reference and identify and remove recombinant regions before building our phylogenetic tree. + ## Summary ::: {.callout-tip} diff --git a/materials/03-nextflow.md b/materials/03-nextflow.md index e1fd9c2..5053054 100644 --- a/materials/03-nextflow.md +++ b/materials/03-nextflow.md @@ -100,8 +100,6 @@ Nextflow provides a separation between the pipeline’s functional logic and the Nextflow provides out-of-the-box support for major batch schedulers and cloud platforms such as Sun Grid Engine, SLURM job scheduler, AWS Batch service and Kubernetes. A full list can be found [here](https://www.nextflow.io/docs/latest/executor.html). -## Nextflow Tower/Seqera - ## Snakemake In this tutorial we've focused on Nextflow but many people in the bioinformatics community use Snakemake. Similar to Nextflow, the Snakemake workflow management system is a tool for creating reproducible and scalable data analyses. The main difference is that workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment. diff --git a/materials/05-intro_tb.md b/materials/05-intro_tb.md index d03c001..d6ec220 100644 --- a/materials/05-intro_tb.md +++ b/materials/05-intro_tb.md @@ -21,7 +21,7 @@ Increasingly, *M. tuberculosis* is resistant to many of the frontline antimicrob ## Course dataset -We will be analysing a dataset of XX Namibian *M. tuberculosis* genomes that was recently published (Claasens 2022). The original dataset consisted of 136 drug-resistant TB isolates collected from patients between 2016-2018 across Namibia. For the purposes of this course and for time, we're only going to analyse 50 genomes from the dataset. +We will be analysing a dataset of Namibian *M. tuberculosis* genomes that was recently published (Claasens 2022). The original dataset consisted of 136 drug-resistant TB isolates collected from patients between 2016-2018 across Namibia. For the purposes of this course, mainly to save time, we're only going to analyse 50 genomes from the dataset. ## MTBC ancestral reference sequence diff --git a/materials/07-bacqc.md b/materials/07-bacqc.md index e18c671..386a468 100644 --- a/materials/07-bacqc.md +++ b/materials/07-bacqc.md @@ -304,6 +304,18 @@ With respect to potential species contamination, no samples contained more than ::: ::: +:::{.callout-warning} + +**The Nextflow work directory** + +Each step of the pipeline produces one or more files that are not saved to the results directory but are kept in the work directory. This means that if, for whatever reason, the pipeline doesn't finish successfully you can resume it. However, once the pipeline has completed successfully, you no longer need this directory (it can take up a lot of space) so you can delete it: + +```bash +rm -rf work +``` + +::: + ## Summary ::: {.callout-tip} diff --git a/materials/09-bactmap.md b/materials/09-bactmap.md index dd2bd13..0008b21 100644 --- a/materials/09-bactmap.md +++ b/materials/09-bactmap.md @@ -161,29 +161,6 @@ This section of the report shows the software run as part of nf-core/bactmap and ![nf-core/bactmap MultiQC software versions](images/bactmap_software_versions.png) -### The pseudogenomes directory - -This directory contains the files that are most useful for our downstream analyses. Change to the directory and list the files: - -```bash -cd bactmap_results/pseudogenomes - -ls -``` - -You will see the pseudogenome fasta files (a version of the reference file where the sample variants have been inserted and low-quality or missing data has been masked) for each sample, an alignment of all the samples and the reference sequence (`aligned_pseudogenomes.fas`), and a tsv file of genomes removed from the complete alignment due to poor mapping (`low_quality_pseudogenomes.tsv`): - -```bash -aligned_pseudogenomes.fas -low_quality_pseudogenomes.tsv -``` - -Let's check to see if any of our samples were removed from our alignment: - -```bash -cat low_quality_pseudogenomes.tsv -``` -This returns an empty file; all of our samples mapped well enough to the reference to be included in our complete alignment. That means we can proceed to the next step of our analysis: phylogenetic tree inference. ## Check how much of the reference was mapped @@ -218,14 +195,6 @@ remove_blocks_from_aln.py -a aligned_pseudogenomes.fas -t ../../../resources/mas bash scripts/04-mask_pseudogenome.sh ``` -### The work directory - -Each step of the pipeline produces one or more files that are not saved to the results directory but are kept in the work directory. This means that if, for whatever reason, the pipeline doesn't finish successfully you can resume it. However, once the pipeline has completed successfully, you no longer need this directory (it can take up a lot of space) so you can delete it: - -```bash -rm -rf work -``` - ## Summary ::: {.callout-tip} diff --git a/materials/17-intro_staph.md b/materials/17-intro_staph.md index 5f69cc4..01af774 100644 --- a/materials/17-intro_staph.md +++ b/materials/17-intro_staph.md @@ -9,6 +9,10 @@ title: "Introduction to *Staphylococcus aureus*" ::: +## Course dataset + +van Tonder 2023 + ## Summary ::: {.callout-tip} diff --git a/materials/18-de_novo_assembly.md b/materials/18-de_novo_assembly.md index ea612d6..f5321bb 100644 --- a/materials/18-de_novo_assembly.md +++ b/materials/18-de_novo_assembly.md @@ -6,7 +6,6 @@ title: "de novo Assembly" ## Learning Objectives - Understand what *de novo* genome assembly is and how it differs from reference-based assembly. -- Assemble sequence data with `assembleBAC` ::: diff --git a/materials/21-assembly_qc.md b/materials/21-assembly_qc.md index b02f7fa..c3e4aa5 100644 --- a/materials/21-assembly_qc.md +++ b/materials/21-assembly_qc.md @@ -37,18 +37,17 @@ Let's now turn to some of the other metrics to help us assess our assemblies' qu ## Contiguity -One of the outputs from running assembleBAC is a summary file containing the `Quast` outputs for each sample. This file can be found in `results/assembleBAC/metadata/transposed_report.csv`. +One of the outputs from running assembleBAC is a summary file containing the `Quast` outputs for each sample. This file can be found in `results/assembleBAC/metadata/transposed_report.tsv`. -You can open it with a spreadsheet software such as _Excel_ from our file browser . -Alternatively you can print it directly on the terminal in a nice format using the `column` command, as shown here for the samples we've been using as an example ((for brevity, we only show some columns of most interest):): - -```bash -column -t -s "," results/assembleBAC/metadata/transposed_report.csv -``` +You can open it with a spreadsheet software such as _Excel_ from our file browser (for brevity, we only show the columns of most interest): ``` Assembly # contigs (>= 0 bp) # contigs (>= 1000 bp) Total length (>= 0 bp) Largest contig N50 -CTMA_1432 2 2 4103466 3057934 3057934 +ERX3876932_ERR3864879_T1_contigs 77 40 2848635 484893 109559 +ERX3876949_ERR3864896_T1_contigs 70 18 2683262 493468 251833 +ERX3876930_ERR3864877_T1_contigs 84 20 2729135 557153 251805 +ERX3876908_ERR3864855_T1_contigs 45 13 2717933 792936 707553 +ERX3876945_ERR3864892_T1_contigs 30 9 2670961 1351816 1351816 ``` The columns are: @@ -61,10 +60,10 @@ The columns are: - **N50** - a metric indicating the length of the shortest fragment, from the group of fragments that together represent at least 50% of the total genome. A higher N50 value suggests better contig lengths. To interpret these statistics, it helps to compare them with other well-assembled _Staphylococcus aureus_ genomes. -For example, let's take the first MRSA genome that was sequenced, [XXXX]('NCBI link'), as our reference for comparison. -This genome is X Mb long and is composed of a single chromosome. +For example, let's take the first MRSA genome that was sequenced, [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/), as our reference for comparison. +This genome is 2.8 Mb long and is composed of a single chromosome. -We can see that all of our assemblies reached a total length of around XX to XX Mb, which matches the expected length from our reference genome. +We can see that all of our assemblies reached a total length of around 2.7 to 2.9 Mb, which matches the expected length from our reference genome. This indicates that we managed to assemble most of the expected genome. However, we can see that there is a variation in the number of fragments in the final assemblies (i.e. their contiguity). For instance, isolates X and X were assembled to a small number of fragments each, suggesting good assemblies. For several other isolates our assemblies were more fragmented, in particular X and X, which had hundreds of fragments. @@ -72,11 +71,15 @@ This indicates less contiguous sequences. ## Completeness -We now turn to assessing **genome completeness**, i.e. whether we managed to recover most of the known _Staphylococcus aureus_ genome, or whether we have large fractions missing. We can assess this by using [`CheckM2`](https://github.com/chklovski/CheckM2) which was run as part of the assembleBAC pipeline. This tool assesses the completeness of the assembled genomes based on other similar organisms in public databases, in addition to contamination scores. The output file from assembleBAC summarising the `CheckM2` results is a tab-delimited file called `checkm2_summary.tsv`. This file can be found in `results/assembleBAC/metadata/` and can be opened in a spreadsheet software such as _Excel_. Here is an example result (for brevity, we only show some columns of most interest): +We now turn to assessing **genome completeness**, i.e. whether we managed to recover most of the known _Staphylococcus aureus_ genome, or whether we have large fractions missing. We can assess this by using [`CheckM2`](https://github.com/chklovski/CheckM2) which was run as part of the assembleBAC pipeline. This tool assesses the completeness of the assembled genomes based on other similar organisms in public databases, in addition to contamination scores. The output file from assembleBAC summarising the `CheckM2` results is a tab-delimited file called `checkm2_summary.tsv`. This file can be found in `results/assembleBAC/metadata/` and can be opened in a spreadsheet software such as _Excel_. Here is an example result (for brevity, we only show the columns of most interest): ``` -Name Completeness Contamination Genome_Size GC_Content Total_Coding_Sequences -isolate01 92.51 5.19 4192535 0.48 5375 +Name Completeness Contamination Genome_Size GC_Content Total_Coding_Sequences +ERX3876905_ERR3864852_T1_contigs 100 0.22 2743298 0.33 2585 +ERX3876907_ERR3864854_T1_contigs 100 0.49 2900162 0.33 2765 +ERX3876908_ERR3864855_T1_contigs 100 0.07 2717933 0.33 2534 +ERX3876909_ERR3864856_T1_contigs 100 0.17 2854371 0.33 2711 +ERX3876929_ERR3864876_T1_contigs 100 0.1 2675834 0.33 2464 ``` These columns indicate: @@ -84,15 +87,13 @@ These columns indicate: - **Completeness** - how complete our genome was inferred to be as a percentage; this is based on the machine learning models used and the organisms present in the database. - **Contamination** - the percentage of the assembly estimated to be contaminated with other organisms (indicating our assembly isn't "pure"). - **Genome_Size** - how big the genome is estimated to be, based on other similar genomes present in the database. - The [XXXX]('NCBI link') is X Mb in total, so these values make sense. + The [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/) is 2.8 Mb in total, so these values make sense. - **GC_Content** - the percentage of G's and C's in the genome, which is relatively constant within a species. - The _S. aureus_ GC content is X%, so again these values make sense. + The _S. aureus_ GC content is approximately 33%, so again these values make sense. - **Total_Coding_Sequences** - the total number of coding sequences (genes) that were identified by _CheckM2_. - The [reference genome]('NCBI link') indicates annotated genes, so the values obtained could be overestimated. - -From this analysis, we can assess that our genome assemblies are good quality, with around 90% of the genome assembled. -Despite earlier assessing that some of the isolates have more fragmented assemblies (e.g. isolates 5 and 9), they still have >85% completeness according to _CheckM2_. + The [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/) indicates annotated genes, so the values obtained could be overestimated. +From this analysis, we can assess that our genome assemblies are good quality, with 100% of the genome assembled for all our isolates. It is worth noting that the assessment from _CheckM2_ is an approximation based on other genomes in the database. In diverse species such as _S. aureus_ the completeness may be underestimated, because each individual strain will only carry part of the _pangenome_ for that species. @@ -123,14 +124,16 @@ We opened this table in _Excel_ and this is what we obtained: ``` Assembly # contigs (>= 0 bp) # contigs (>= 1000 bp) # contigs (>= 5000 bp) # contigs (>= 10000 bp) # contigs (>= 25000 bp) # contigs (>= 50000 bp) Total length (>= 0 bp) Total length (>= 1000 bp) Total length (>= 5000 bp) Total length (>= 10000 bp) Total length (>= 25000 bp) Total length (>= 50000 bp) # contigs Largest contig Total length GC (%) N50 N90 auN L50 L90 # N's per 100 kbp -CTMA_1432 2 2 2 2 2 2 4103466 4103466 4103466 4103466 4103466 4103466 2 3057934 4103466 47.50 3057934 1045532 2545189.2 1 2 0.00 +ERX3876932_ERR3864879_T1_contigs 77 40 36 29 25 20 2848635 2836870 2827630 2778943 2698417 2521114 47 484893 2841571 32.73 109559 41891 191514.9 6 21 0 +ERX3876949_ERR3864896_T1_contigs 70 18 16 15 14 11 2683262 2671100 2666432 2659228 2636490 2507149 21 493468 2673387 32.69 251833 78361 283091.3 4 10 0 +ERX3876930_ERR3864877_T1_contigs 84 20 20 18 15 11 2729135 2714469 2714469 2701567 2644913 2501830 25 557153 2717749 32.7 251805 52910 319559.1 4 10 0 ``` -The assembly lengths obtained were all around 2 Mb, which is expected for this species. +The assembly lengths obtained were all around 2.7 to 2.8 Mb, which is expected for this species. -The contiguity of our assemblies seemed excellent, with only up to XX fragments. Some samples even had X fragments assembled, suggesting we managed to mostly reassemble our genomes. +The contiguity of our assemblies seemed excellent, with only up to 205 fragments. Some samples even had fewer than 50 fragments assembled, suggesting we managed to mostly reassemble our genomes. -Isolates XX and XX have more than X contigs. We may want to consider removing these isolates as they may affect our pan-genome calculation in the next section. +Isolate ERX3876939_ERR3864886_T1_contigs has more than 200 contigs. A this is higher than the number obtained for the rest of our assembled genomeWe may want to consider removing this isolate as it may affect our pan-genome calculation in the next section. However, as the rest of the metrics for this genome look alright, let's leave it in for now. ::: ::: @@ -151,14 +154,18 @@ We opened the file `results/assembleBAC/metadata/checkm2_summary.tsv` in _Excel_ ``` Name Completeness Contamination Completeness_Model_Used Translation_Table_Used Coding_Density Contig_N50 Average_Gene_Length Genome_Size GC_Content Total_Coding_Sequences Additional_Notes -CTMA_1402 99.99 1.16 Neural Network (Specific Model) 11 0.865 3058794 310.65996343692865 4118479 0.47 3829 None +ERX3876905_ERR3864852_T1_contigs 100 0.22 Neural Network (Specific Model) 11 0.841 188199 297.9439072 2743298 0.33 2585 None +ERX3876907_ERR3864854_T1_contigs 100 0.49 Neural Network (Specific Model) 11 0.839 174940 293.8538879 2900162 0.33 2765 None +ERX3876908_ERR3864855_T1_contigs 100 0.07 Neural Network (Specific Model) 11 0.841 707553 301.2190213 2717933 0.33 2534 None +ERX3876909_ERR3864856_T1_contigs 100 0.17 Neural Network (Specific Model) 11 0.84 181628 295.3832534 2854371 0.33 2711 None +ERX3876929_ERR3864876_T1_contigs 100 0.1 Neural Network (Specific Model) 11 0.841 605766 304.9176136 2675834 0.33 2464 None ``` We can see from this that: -- We achieved ~90% completeness (according to _CheckM2_'s database) for all our samples. +- We achieved 100% completeness (according to _CheckM2_'s database) for all our samples. - There was no evidence of strong contamination affecting our assemblies (all ~5% or below). -- The GC content was XX%, which is what is expected for _Staphylococcus aureus_. +- The GC content was 33%, which is what is expected for _Staphylococcus aureus_. ::: ::: diff --git a/materials/29-pathogenwatch.md b/materials/29-pathogenwatch.md index e2f0a19..3787627 100644 --- a/materials/29-pathogenwatch.md +++ b/materials/29-pathogenwatch.md @@ -9,7 +9,17 @@ title: "Pathogenwatch 2" ::: -Assemblies provided in `results/assemblebac/assemblies` +:::{.callout-exercise} +#### Analysing Pneumococcal genomes with Pathogenwatch + +Whilst you can upload FASTQ files to Pathogenwatch, it's quicker if we work with already assembled genomes. We've provided pre-processed results for the _S. pneumoniae_ data generated with AssembleBAC in the `preprocessed/` directory which you can use to upload to Pathogenwatch in this exercise. + +:::{.callout-answer} + + +::: + + ## Summary diff --git a/materials/31-intro_amr.md b/materials/31-intro_amr.md index 1a951c5..22534b0 100644 --- a/materials/31-intro_amr.md +++ b/materials/31-intro_amr.md @@ -3,15 +3,396 @@ title: "Introduction to Antimicrobial resistance" --- ::: {.callout-tip} -## Learning Objectives +#### Learning Objectives -- Understand what AMR is. +- Recognise the threats posed by antimicrobial resistance (AMR) to public health globally. +- Apply both command line and web applications to find potential AMR genes in a set of genomes. +- Recognise the limitations of computational AMR prediction and the importance of comparing results across multiple tools. ::: +## Antimicrobial Resistance (AMR) analysis + +Antimicrobial resistance (AMR) is a phenomenon where microorganisms, such as bacteria, evolve in a way that reduces the effectiveness of antimicrobial drugs, including antibiotics. +This occurs due to the overuse and misuse of these drugs, which exerts selective pressure on the microorganisms. +As a result, bacteria may develop or acquire genetic changes that enable them to **survive exposure to antimicrobial agents**, making the drugs less effective or entirely ineffective. +AMR poses a significant **global health threat**, as it can lead to infections that are challenging to treat, potentially causing increased morbidity and mortality. +Efforts to combat AMR include responsible antibiotic use, developing new drugs, and enhancing infection prevention and control measures. + +According to the [WHO](https://www.who.int/publications/i/item/9789241564748), antimicrobial resistance (AMR) has evolved into a global concern for public health. +This stems from various harmful bacterial strains developing resistance to antimicrobial medications, including antibiotics. +As part of our analysis, we will now focus on identifying AMR patterns connected to our _S. pneumoniae_ isolates. + +Numerous software tools have been created to predict the presence of genes linked to AMR in genome sequences. +Estimating the function of a gene or protein solely from its sequence is complex, leading to varying outcomes across different software tools. +It is advisable to employ multiple tools and compare their findings, thus increasing our confidence in identifying which antimicrobial drugs might be more effective for treating patients infected with the strains we're studying. + +In this section we will introduce a workflow aimed at combining the results from various AMR tools into a unified analysis. +We will compare its results with AMR analysis performed by _Pathogenwatch_. + +## _Funcscan_ workflow {#sec-funcscan} + +Here, we introduce an automated workflow called **[`nf-core/funcscan`](https://nf-co.re/funcscan/1.1.2)** (@fig-funcscan), which uses _Nextflow_ to manage all the software and analysis steps (see information box above). +This pipeline uses five different AMR screening tools: **[ABRicate](https://github.com/tseemann/abricate)**, **[AMRFinderPlus (NCBI Antimicrobial Resistance Gene Finder)](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/)**, **[fARGene (Fragmented Antibiotic Resistance Gene idENntifiEr)](https://github.com/fannyhb/fargene)**, **[RGI (Resistance Gene Identifier)](https://card.mcmaster.ca/analyze/rgi)**, and **[DeepARG](https://readthedocs.org/projects/deeparg/)**. +This is convenient, as we can obtain the results from multiple approaches in one step. + +![Overview of the `nf-core/funcscan` workflow. In our case we will run the "Antimicrobial Resistance Genes (ARGs)" analysis, shown in yellow. Image source: https://nf-co.re/funcscan/1.1.2](https://raw.githubusercontent.com/nf-core/funcscan/1.1.2/docs/images/funcscan_metro_workflow.png){#fig-funcscan} + +This pipeline requires us to prepare a samplesheet CSV file with information about the samples we want to analyse. +Two columns are required: + +- `sample` --> a sample name of our choice (we will use the same name that we used for the assembly). +- `fasta` --> the path to the FASTA file corresponding to that sample. + +You can create this file using a spreadsheet software such as _Excel_, making sure to save the file as a CSV. +Here is an example of our samplesheet, which we saved in a file called `samplesheet_funcscan.csv`: + +``` +sample,fasta +ERX1265396_ERR1192012_T1,preprocessed/assemblebac/assemblies/ERX1265396_ERR1192012_T1_contigs.fa +ERX1265488_ERR1192104_T1,preprocessed/assemblebac/assemblies/ERX1265488_ERR1192104_T1_contigs.fa +ERX1501202_ERR1430824_T1,preprocessed/assemblebac/assemblies/ERX1501202_ERR1430824_T1_contigs.fa +ERX1501203_ERR1430825_T1,preprocessed/assemblebac/assemblies/ERX1501203_ERR1430825_T1_contigs.fa +ERX1501204_ERR1430826_T1,preprocessed/assemblebac/assemblies/ERX1501204_ERR1430826_T1_contigs.fa +ERX1501205_ERR1430827_T1,preprocessed/assemblebac/assemblies/ERX1501205_ERR1430827_T1_contigs.fa +ERX1501206_ERR1430828_T1,preprocessed/assemblebac/assemblies/ERX1501206_ERR1430828_T1_contigs.fa +ERX1501207_ERR1430829_T1,preprocessed/assemblebac/assemblies/ERX1501207_ERR1430829_T1_contigs.fa +ERX1501208_ERR1430830_T1,preprocessed/assemblebac/assemblies/ERX1501208_ERR1430830_T1_contigs.fa +ERX1501212_ERR1430834_T1,preprocessed/assemblebac/assemblies/ERX1501212_ERR1430834_T1_contigs.fa +``` + +Once we have the samplesheet ready, we can run the `nf-core/funcscan` workflow using the following commands: + +```bash +# activate the environment +mamba activate nextflow + +# create output directory +mkdir results/funcscan + +# run the pipeline +nextflow run nf-core/funcscan -profile singularity \ + --max_memory 16.GB --max_cpus 8 \ + --input samplesheet_funcscan.csv \ + --outdir results/funcscan \ + --run_arg_screening \ + --arg_skip_deeparg +``` + +The options we used are: + +- `-profile singularity` - indicates we want to use the _Singularity_ program to manage all the software required by the pipeline (another option is to use `docker`). See [Data & Setup](../../setup.md) for details about their installation. +- `--max_memory` and `--max_cpus` - sets the available RAM memory and CPUs. You can check this with the commands `free -h` and `nproc --all`, respectively. +- `--input` - the samplesheet with the input files, as explained above. +- `--outdir` - the output directory for the results. +- `--run_arg_screening` - indicates we want to run the "antimicrobial resistance gene screening tools". There are also options to run antimicrobial peptide and biosynthetic gene cluster screening ([see documentation](https://nf-co.re/funcscan/1.1.2/parameters#screening-type-activation)). +- `--arg_skip_deeparg` - this skips a step in the analysis which uses the software _DeepARG_. We did this simply because this software takes a very long time to run. But in a real analysis you may want to leave this option on. + +While the pipeline runs, you will get a progress printed on the screen, and then a message once it finishes. +Here is an example from our samples: + +``` +* Software dependencies + https://github.com/nf-core/funcscan/blob/master/CITATIONS.md +------------------------------------------------------ +executor > slurm (1371) +[cf/4d7565] process > NFCORE_FUNCSCAN:FUNCSCAN:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet_funcscan.csv) [100%] 1 of 1 ✔ +[- ] process > NFCORE_FUNCSCAN:FUNCSCAN:GUNZIP_FASTA_PREP - +[f5/5767a9] process > NFCORE_FUNCSCAN:FUNCSCAN:BIOAWK (ERX1501218_ERR1430840_T1) [100%] 50 of 50 ✔ +[a9/164c9c] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:AMRFINDERPLUS_UPDATE (update) [100%] 1 of 1, cached: 1 ✔ +[8f/6d023e] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:AMRFINDERPLUS_RUN (ERX1501260_ERR1430882_T1) [100%] 50 of 50 ✔ +[6a/c9f0ce] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:HAMRONIZATION_AMRFINDERPLUS (ERX1501260_ERR1430882_T1) [100%] 50 of 50 ✔ +[da/67f664] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:FARGENE (ERX1501226_ERR1430848_T1) [100%] 500 of 500 ✔ +[79/73f7b3] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:HAMRONIZATION_FARGENE (ERX1501226_ERR1430848_T1) [ 99%] 519 of 520 +[0f/d47de8] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:RGI_MAIN (ERX1501263_ERR1430885_T1) [100%] 50 of 50 ✔ +[21/06df72] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:HAMRONIZATION_RGI (ERX1501263_ERR1430885_T1) [100%] 50 of 50 ✔ +[eb/b67eca] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:ABRICATE_RUN (ERX1501218_ERR1430840_T1) [100%] 50 of 50 ✔ +[d6/e5fdd4] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:HAMRONIZATION_ABRICATE (ERX1501218_ERR1430840_T1) [100%] 50 of 50 ✔ +[- ] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:HAMRONIZATION_SUMMARIZE - +[- ] process > NFCORE_FUNCSCAN:FUNCSCAN:CUSTOM_DUMPSOFTWAREVERSIONS - +[- ] process > NFCORE_FUNCSCAN:FUNCSCAN:MULTIQC - +``` + + +### `funcscan` outputs + +The main output of interest from this pipeline is a CSV file, which contains a summary of the results from all the AMR tools used +This summary is produced by a software called [_hAMRonization_](https://github.com/pha4ge/hAMRonization) and the corresponding CSV file is saved in `results/funcscan/reports/hamronization_summarize/hamronization_combined_report.tsv`. +You can open this file using any standard spreadsheet software such as _Excel_ (@fig-hamronization). + +This file is quite large, containing many columns and rows (we detail these columns in the information box below). +The easiest way to query this table is to filter the table based on the column "antimicrobial_agent" to remove rows where no AMR gene was detected (@fig-hamronization). +This way you are left with only the results which were positive for the AMR analysis. + +![To analyse the table output by _hAMRonization_ in _Excel_ you can go to "Data" --> "Filter". Then, select the dropdown button on the "antimicrobial_agent" column and untick the box "blank". This will only show the genes associated with resistance to antimicrobial drugs.](images/amr_hamronization.svg){#fig-hamronization} + +:::{.callout-note collapse=true} +#### _hAMRonization_ report columns (click to expand) + +TODO + +::: + + +You can also look at the detailed results of each individual tool, which can be found in the directory `results/funcscan/arg`. +This directory contains sub-directories for each of the 5 AMR tools used (in our case only 4 folders, because we skipped the _DeepARG_ step): + +```bash +ls results/funcscan/arg +``` + +``` +abricate amrfinderplus fargene hamronization rgi +``` + +For each individual tool's output folder shown above, there is a report, which is associated with the predicted AMRs for each of our samples. +In most cases, the report is in tab-delimited TSV format, which can be opened in a standard spreadsheet software such as _Excel_. +For instance, the AMR report from _Abricate_ for one of our samples looks like this: + +```bash +less -S results/funcscan/arg/abricate/isolate02/isolate02.txt +``` + +``` +#FILE SEQUENCE START END STRAND GENE COVERAGE COVERAGE_MAP GAPS %COVERAGE %IDENTITY DATABASE ACCESSION PRODUCT RESISTANCE +isolate02.fasta contig_2 1696 2623 - blaPER-7 1-927/927 ========/====== 1/1 100.00 99.89 ncbi NG_049966.1 class A extended-spectrum beta-lactamase PER-7 CEPHALOSPORIN +isolate02.fasta contig_2 4895 5738 - sul1 1-840/840 ========/====== 4/4 100.00 98.93 ncbi NG_048091.1 sulfonamide-resistant dihydropteroate synthase Sul1 SULFONAMIDE +isolate02.fasta contig_2 6243 7036 - aadA2 1-792/792 ========/====== 2/2 100.00 99.50 ncbi NG_047343.1 ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA2 STREPTOMYCIN +isolate02.fasta contig_3 966452 967081 + catB9 1-630/630 =============== 0/0 100.00 99.84 ncbi NG_047621.1 type B-5 chloramphenicol O-acetyltransferase CatB9 CHLORAMPHENICOL +isolate02.fasta contig_4 778899 780023 + varG 1-1125/1125 =============== 0/0 100.00 100.00 ncbi NG_057468.1 VarG family subclass B1-like metallo-beta-lactamase CARBAPENEM +isolate02.fasta contig_4 2573875 2574348 - dfrA1 1-474/474 =============== 0/0 100.00 100.00 ncbi NG_047676.1 trimethoprim-resistant dihydrofolate reductase DfrA1 TRIMETHOPRIM +isolate02.fasta contig_7 4178 5099 - mph(A) 1-921/921 ========/====== 1/1 100.00 99.35 ncbi NG_047986.1 Mph(A) family macrolide 2'-phosphotransferase MACROLIDE +isolate02.fasta contig_7 6594 8069 + msr(E) 1-1476/1476 =============== 0/0 100.00 100.00 ncbi NG_048007.1 ABC-F type ribosomal protection protein Msr(E) MACROLIDE +isolate02.fasta contig_7 8125 9009 + mph(E) 1-885/885 =============== 0/0 100.00 100.00 ncbi NG_064660.1 Mph(E) family macrolide 2'-phosphotransferase MACROLIDE +isolate02.fasta contig_7 131405 132197 + aadA2 1-792/792 ========/====== 1/1 100.00 99.62 ncbi NG_047343.1 ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA2 STREPTOMYCIN +``` + +For this sample there were several putative AMR genes detected by _Abricate_, with their associated drugs. +These genes were identified based on their similarity with annotated sequences from the NCBI database. +For example, the gene _varG_ was detected in our sample, matching the NCBI accession [NG_057468.1](https://www.ncbi.nlm.nih.gov/nuccore/NG_057468.1). +This is annotated as as a reference for antimicrobial resistance, in this case to the drug "CARBAPENEM". + +:::{.callout-note} +#### Command line trick + +Here is a trick using standard commands to count how many times each drug was identified by `funcscan`: + +```bash +cat results/funcscan/reports/hamronization_summarize/hamronization_combined_report.tsv | cut -f 10 | sort | uniq -c +``` + +- `cat` prints the content of the file +- `cut` extracts the 10th column from the file +- `sort` and `uniq -c` are used in combination to count unique output values + +The result of the above command is: + +``` + 9 CARBAPENEM + 8 CEPHALOSPORIN + 8 CHLORAMPHENICOL +27 MACROLIDE +13 QUATERNARY AMMONIUM +10 STREPTOMYCIN + 1 SULFONAMIDE +10 TRIMETHOPRIM +``` +::: + + +## AMR with _Pathogenwatch_ + +_Pathogenwatch_ also performs AMR prediction using its own [algorithm and curated gene sequences](https://cgps.gitbook.io/pathogenwatch/technical-descriptions/antimicrobial-resistance-prediction/pw-amr). +The results from this analysis can be seen from the individual sample report, or summarised in the collection view. + +![AMR analysis from _Pathogenwatch_. The summary table (top) can be accessed from the sample collections view, by selecting "Antibiotics" from the drop-down on the top-left. The table summarises resistance to a range of antibiotics (red = resistant; yellow = intermediate). More detailed results can be viewed for each individual sample by clicking on its name and opening the sample report (bottom).](images/amr_pathogenwatch.png){#fig-amr_pathogenwatch} + + +## Which AMR do my isolates have? + +At this stage you may notice that different tools will give you a different answer to this question and it is therefore recommended to **compare the results across multiple tools**. +For example, _Pathogenwatch_ generally detects AMR for comparatively more antimicrobial drugs compared to the `funcscan` analysis. +However, some of the drugs detected by `funcscan` were either not reported by _Pathogenwatch_ (possibly because they are not part of its database) or have a disagreeing result. + +Let's take a specific example. +_Pathogenwatch_ determined that none of our isolates were resistant to Streptomycin. +However, in the _hAMRonization_ summary table (output by `funcscan`) we can see that this drug was reported for several of our samples. +Upon closer inspection, however, we can see that we only had partial matches to the reference NCBI sequence ([WP_001206356.1](https://www.ncbi.nlm.nih.gov/protein/WP_001206356.1)), or in the case of one sample with a higher match the sequence identity was less than 100% (table below, showing some of the columns from the _hAMRonization_ table). + +``` +input_file_name gene_symbol reference_accession antimicrobial_agent coverage_percentage sequence_identity +isolate01.tsv.amrfinderplus aadA2 WP_001206356.1 STREPTOMYCIN 68.44 100 +isolate02.tsv.amrfinderplus aadA2 WP_001206356.1 STREPTOMYCIN 68.44 100 +isolate02.tsv.amrfinderplus aadA2 WP_001206356.1 STREPTOMYCIN 68.44 100 +isolate04.tsv.amrfinderplus aadA2 WP_001206356.1 STREPTOMYCIN 68.44 100 +isolate05.tsv.amrfinderplus aadA2 WP_001206356.1 STREPTOMYCIN 68.44 100 +isolate06.tsv.amrfinderplus aadA2 WP_001206356.1 STREPTOMYCIN 68.44 100 +isolate07.tsv.amrfinderplus aadA2 WP_001206356.1 STREPTOMYCIN 68.44 100 +isolate08.tsv.amrfinderplus aadA2 WP_001206356.1 STREPTOMYCIN 68.44 100 +isolate09.tsv.amrfinderplus aadA2 WP_001206356.1 STREPTOMYCIN 68.44 100 +isolate10.tsv.amrfinderplus aadA2 WP_001206356.1 STREPTOMYCIN 100 92.05 +``` + +It is also important to take into consideration our earlier [assembly quality assessments](../02-assembly/04-assembly_quality.md) as they may result in **false negative results**. +For example, we can see that "isolate05" has the lowest AMR detection of all samples. +However, this was the sample with the lowest genome coverage (only 21x) and with a resulting highly fragmented genome (229 fragments). +Therefore, it is very possible that we missed parts of its genome during assembly, and that some of those contained AMR genes or plasmids. + +In conclusion, always be critical of the analysis of your results at this stage, comparing the output from different tools as well as considering the quality of your assemblies. +Ultimately, the safest way to assess AMR is with **experimental validation**, by testing those strains against the relevant antimicrobial agents in the lab. +However, computational analysis such as what we did can help inform these experiments and treatment decisions. + + +## Exercises + + +For these exercises, you can either use the dataset we provide in [**Data & Setup**](../../setup.md), or your own data. +You also need to have completed the genome assembly exercise in @sec-ex-assembly. + +:::{.callout-exercise} +#### Funcscan workflow + +Run the `nf-core/funcscan` workflow on the assembled genomes for your samples. + +- Using _Excel_, create a samplesheet CSV file for your samples, required as input for this pipeline. See @sec-funcscan if you need to revise the format of this samplesheet. +- Activate the software environment: `mamba activate nextflow`. +- Fix the script provided in `scripts/07-amr.sh`. +- Run the script using `bash scripts/07-amr.sh`. + +Once the workflow is running it will print a progress message on the screen. +This will take a while to run, so you can do the next exercise, and then continue with this one. + +Once the analysis finishes, open the output file `results/funcscan/reports/hamronization_summarize/hamronization_combined_report.tsv` in _Excel_. +Answer the following questions: + +- Did this analysis find evidence for antimicrobial resistance to any drugs? +- Did all your samples show evidence for AMR? + + +:::{.callout-answer} + +We created a samplesheet for our samples in _Excel_, making sure to "Save As..." CSV format. +The raw file looks like this: + +``` +sample,fasta +CTMA_1402,results/assemblies/CTMA_1402.fasta +CTMA_1421,results/assemblies/CTMA_1421.fasta +CTMA_1427,results/assemblies/CTMA_1427.fasta +CTMA_1432,results/assemblies/CTMA_1432.fasta +CTMA_1473,results/assemblies/CTMA_1473.fasta +``` + +The fixed script is: + +```bash +#!/bin/bash + +# create output directory +mkdir results/funcscan + +# run the pipeline +nextflow run nf-core/funcscan -profile singularity \ + --max_memory 16.GB --max_cpus 8 \ + --input samplesheet_funcscan.csv \ + --outdir results/funcscan \ + --run_arg_screening \ + --arg_skip_deeparg +``` + +While the script was running we got a progress of the analysis printed on the screen. +Once it finished we got a message like this (yours will look slightly different): + +``` +Completed at: 12-Aug-2023 12:24:03 +Duration : 44m 54s +CPU hours : 3.0 +Succeeded : 277 +``` + +We opened the _hAMRonization_ output report file in _Excel_ and filtered it for the column "antimicrobial_agent". +We identified the following (only a few columns shown for simplicity): + +``` +input_file_name gene_symbol antimicrobial_agent +CTMA_1402.tsv.amrfinderplus aph(6)-Id STREPTOMYCIN +CTMA_1402.tsv.amrfinderplus catB9 CHLORAMPHENICOL +CTMA_1402.tsv.amrfinderplus dfrA1 TRIMETHOPRIM +CTMA_1402.tsv.amrfinderplus floR CHLORAMPHENICOL/FLORFENICOL +CTMA_1402.tsv.amrfinderplus sul2 SULFONAMIDE +CTMA_1402.tsv.amrfinderplus varG CARBAPENEM +CTMA_1421.tsv.amrfinderplus aph(3'')-Ib STREPTOMYCIN +CTMA_1421.tsv.amrfinderplus aph(6)-Id STREPTOMYCIN +CTMA_1421.tsv.amrfinderplus catB9 CHLORAMPHENICOL +CTMA_1421.tsv.amrfinderplus dfrA1 TRIMETHOPRIM +CTMA_1421.tsv.amrfinderplus floR CHLORAMPHENICOL/FLORFENICOL +CTMA_1421.tsv.amrfinderplus sul2 SULFONAMIDE +CTMA_1421.tsv.amrfinderplus varG CARBAPENEM +CTMA_1427.tsv.amrfinderplus aph(3'')-Ib STREPTOMYCIN +CTMA_1427.tsv.amrfinderplus aph(6)-Id STREPTOMYCIN +CTMA_1427.tsv.amrfinderplus catB9 CHLORAMPHENICOL +CTMA_1427.tsv.amrfinderplus dfrA1 TRIMETHOPRIM +CTMA_1427.tsv.amrfinderplus floR CHLORAMPHENICOL/FLORFENICOL +CTMA_1427.tsv.amrfinderplus sul2 SULFONAMIDE +CTMA_1427.tsv.amrfinderplus varG CARBAPENEM +CTMA_1432.tsv.amrfinderplus aph(3'')-Ib STREPTOMYCIN +CTMA_1432.tsv.amrfinderplus aph(6)-Id STREPTOMYCIN +CTMA_1432.tsv.amrfinderplus catB9 CHLORAMPHENICOL +CTMA_1432.tsv.amrfinderplus dfrA1 TRIMETHOPRIM +CTMA_1432.tsv.amrfinderplus floR CHLORAMPHENICOL/FLORFENICOL +CTMA_1432.tsv.amrfinderplus sul2 SULFONAMIDE +CTMA_1432.tsv.amrfinderplus varG CARBAPENEM +CTMA_1473.tsv.amrfinderplus aph(3'')-Ib STREPTOMYCIN +CTMA_1473.tsv.amrfinderplus aph(6)-Id STREPTOMYCIN +CTMA_1473.tsv.amrfinderplus catB9 CHLORAMPHENICOL +CTMA_1473.tsv.amrfinderplus dfrA1 TRIMETHOPRIM +CTMA_1473.tsv.amrfinderplus floR CHLORAMPHENICOL/FLORFENICOL +CTMA_1473.tsv.amrfinderplus sul2 SULFONAMIDE +CTMA_1473.tsv.amrfinderplus varG CARBAPENEM +``` + +All 5 of our samples show evidence of AMR to different antimicrobial drugs. +All of them are quite similar, with resistance to a similar set of drugs. + +::: +::: + +:::{.callout-exercise} +#### AMR with _Pathogenwatch_ + +Following from the _Pathogenwatch_ exercise in @sec-ex-pathogenwatch, open the "Ambroise 2023" collection that you created and answer the following questions: + +- Open the antibiotics summary table. +- Do all your samples have evidence for antibiotic resistance? +- If any samples have resistance to much fewer antibiotics compared to the others, do you think this could be related to assembly quality? +- How do the results from _Pathogenwatch_ compare to those from `nf-core/funcscan`? + +How do the results compare with Pathogenwatch? + +:::{.callout-answer} + +We can open the "Antibiotics" table from the top-left dropdown menu, as shown in the image below. + +![](images/pathogenwatch-ambroise04.png) + +We can see that _Pathogenwatch_ identified resistance to several antibiotics. +All samples are similar, except "1432" doesn't have resistance to furazolidone and nitrofurantoin. +This sample had high completeness, according to _CheckM2_. +However, it was also the sample with the lowest sequencing coverage (from our analysis during the genome assembly in @sec-ex-assembly), so this could be a "false negative" result due to the inability to cover some of the genes that confer AMR. + +All of the drugs identified by _funcscan_ were also identified by _Pathogenwatch_ (note that sulfamethoxazole and sulfisoxazole identified by _Pathogenwatch_ are both sulfonamide-derived drugs, reported by funcscan). +However, _Pathogenwatch_ identified resistance to several other drugs: ampicilin, ceftazidime, cephalosporins, ceftriazone, cefepime, furazolidone and nitrofurantoin. +::: +::: + ## Summary ::: {.callout-tip} -## Key Points +#### Key Points +- AMR poses significant global public health threats by diminishing the effectiveness of antibiotics, making it challenging to treat infectious diseases effectively. +- AMR software aims to identify specific genes or mutations known to confer resistance to antimicrobial agents. These tools compare input genetic sequences to known resistance genes or patterns in their associated databases. +- The `nf-core/funcscan` workflow performs AMR analysis using several software tools and producing a summary of their results as a CSV file. +- _Pathogenwatch_ is a more user-friendly application, which performs AMR using its own curated database. +- AMR prediction can result in false results (either false positives or false negatives). One way to overcome this limitation is to compare the results from multiple tools and, whenever possible, complement it with validation assays in the lab. ::: \ No newline at end of file diff --git a/materials/appendices/02-course_software.md b/materials/appendices/02-course_software.md index 828f50f..09e2ed6 100644 --- a/materials/appendices/02-course_software.md +++ b/materials/appendices/02-course_software.md @@ -60,6 +60,10 @@ Sequence Type assignment summarises and creates visualizations for outputs from `fastQC`, `fastp` and `MultiQC` +### [Nextflow](https://www.nextflow.io/) + +### [pairsnp](https://github.com/gtonkinhill/pairsnp) + ### [Panaroo](https://gtonkinhill.github.io/panaroo/) ### [Quast](https://quast.sourceforge.net/) @@ -70,10 +74,14 @@ assembly metrics downsamples fastq files to 100X by default (Optional) +### [remove_blocks_from_aln.py](https://github.com/sanger-pathogens/remove_blocks_from_aln) + ### [SAMtools](https://sourceforge.net/projects/samtools/files/samtools/) sorts and indexes alignments +### [seqtk](https://github.com/lh3/seqtk) + ### [Shovill](https://github.com/tseemann/shovill) *de novo* genome assembly