From 98c47b08c6df891ca377d06588c211364eb9653b Mon Sep 17 00:00:00 2001
From: avantonder <ajv37@cam.ac.uk>
Date: Wed, 15 Nov 2023 17:33:26 +0000
Subject: [PATCH] update 151123

---
 materials/02-know_your_bug.md              |   6 +
 materials/03-nextflow.md                   |   2 -
 materials/05-intro_tb.md                   |   2 +-
 materials/07-bacqc.md                      |  12 +
 materials/09-bactmap.md                    |  31 --
 materials/17-intro_staph.md                |   4 +
 materials/18-de_novo_assembly.md           |   1 -
 materials/21-assembly_qc.md                |  61 ++--
 materials/29-pathogenwatch.md              |  12 +-
 materials/31-intro_amr.md                  | 387 ++++++++++++++++++++-
 materials/appendices/02-course_software.md |   8 +
 11 files changed, 460 insertions(+), 66 deletions(-)

diff --git a/materials/02-know_your_bug.md b/materials/02-know_your_bug.md
index e68ce8b..d49f9f0 100644
--- a/materials/02-know_your_bug.md
+++ b/materials/02-know_your_bug.md
@@ -13,6 +13,12 @@ Before starting your analysis, it's important to understand the characteristics
 
 ![Flowchart showing different approaches to building a bacterial phylogenetic tree](images/know_your_bug.jpeg)
 
+This week, we're going to work with three different bacterial species that each require a different analysis path through the diagram above.
+
+- _Mycobacterium tuberculosis_ - a monoclonal (i.e. relatively little genomic diversity) organism that doesn't recombine so we'll build a phylogenetic tree by mapping to a reference.
+- _Staphylococcus aureus_ - a diverse organism with low levels of recombination. We're going to be working with a dataset consisting of more than a single lineage so we'll take the approach of building a core gene alignment to use to infer our phylogenetic tree.
+- _Streptococcus pneumoniae_ - a diverse organism that frequently recombines. As the dataset consists of isolates from a single lineage/serotype, we're going to map to a reference and identify and remove recombinant regions before building our phylogenetic tree.
+
 ## Summary
 
 ::: {.callout-tip}
diff --git a/materials/03-nextflow.md b/materials/03-nextflow.md
index e1fd9c2..5053054 100644
--- a/materials/03-nextflow.md
+++ b/materials/03-nextflow.md
@@ -100,8 +100,6 @@ Nextflow provides a separation between the pipeline’s functional logic and the
 
 Nextflow provides out-of-the-box support for major batch schedulers and cloud platforms such as Sun Grid Engine, SLURM job scheduler, AWS Batch service and Kubernetes. A full list can be found [here](https://www.nextflow.io/docs/latest/executor.html).
 
-## Nextflow Tower/Seqera
-
 ## Snakemake
 
 In this tutorial we've focused on Nextflow but many people in the bioinformatics community use Snakemake.  Similar to Nextflow, the Snakemake workflow management system is a tool for creating reproducible and scalable data analyses. The main difference is that workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
diff --git a/materials/05-intro_tb.md b/materials/05-intro_tb.md
index d03c001..d6ec220 100644
--- a/materials/05-intro_tb.md
+++ b/materials/05-intro_tb.md
@@ -21,7 +21,7 @@ Increasingly, *M. tuberculosis* is resistant to many of the frontline antimicrob
 
 ## Course dataset
 
-We will be analysing a dataset of XX Namibian *M. tuberculosis* genomes that was recently published (Claasens 2022). The original dataset consisted of 136 drug-resistant TB isolates collected from patients between 2016-2018 across Namibia. For the purposes of this course and for time, we're only going to analyse 50 genomes from the dataset.
+We will be analysing a dataset of Namibian *M. tuberculosis* genomes that was recently published (Claasens 2022). The original dataset consisted of 136 drug-resistant TB isolates collected from patients between 2016-2018 across Namibia. For the purposes of this course, mainly to save time,  we're only going to analyse 50 genomes from the dataset.
 
 ## MTBC ancestral reference sequence
 
diff --git a/materials/07-bacqc.md b/materials/07-bacqc.md
index e18c671..386a468 100644
--- a/materials/07-bacqc.md
+++ b/materials/07-bacqc.md
@@ -304,6 +304,18 @@ With respect to potential species contamination, no samples contained more than
 :::
 :::
 
+:::{.callout-warning}
+
+**The Nextflow work directory**
+
+Each step of the pipeline produces one or more files that are not saved to the results directory but are kept in the work directory.  This means that if, for whatever reason, the pipeline doesn't finish successfully you can resume it.  However, once the pipeline has completed successfully, you no longer need this directory (it can take up a lot of space) so you can delete it:
+
+```bash
+rm -rf work
+```
+
+:::
+
 ## Summary
 
 ::: {.callout-tip}
diff --git a/materials/09-bactmap.md b/materials/09-bactmap.md
index dd2bd13..0008b21 100644
--- a/materials/09-bactmap.md
+++ b/materials/09-bactmap.md
@@ -161,29 +161,6 @@ This section of the report shows the software run as part of nf-core/bactmap and
 
 ![nf-core/bactmap MultiQC software versions](images/bactmap_software_versions.png)
 
-### The pseudogenomes directory
-
-This directory contains the files that are most useful for our downstream analyses.  Change to the directory and list the files:
-
-```bash
-cd bactmap_results/pseudogenomes
-
-ls
-```
-
-You will see the pseudogenome fasta files (a version of the reference file where the sample variants have been inserted and low-quality or missing data has been masked) for each sample, an alignment of all the samples  and the reference sequence (`aligned_pseudogenomes.fas`), and a tsv file of genomes removed from the complete alignment due to poor mapping (`low_quality_pseudogenomes.tsv`):
-
-```bash
-aligned_pseudogenomes.fas
-low_quality_pseudogenomes.tsv
-```
-
-Let's check to see if any of our samples were removed from our alignment:
-
-```bash
-cat low_quality_pseudogenomes.tsv
-```
-This returns an empty file; all of our samples mapped well enough to the reference to be included in our complete alignment.  That means we can proceed to the next step of our analysis: phylogenetic tree inference.
 
 ## Check how much of the reference was mapped
 
@@ -218,14 +195,6 @@ remove_blocks_from_aln.py -a aligned_pseudogenomes.fas -t ../../../resources/mas
 
 bash scripts/04-mask_pseudogenome.sh
 ```
-### The work directory
-
-Each step of the pipeline produces one or more files that are not saved to the results directory but are kept in the work directory.  This means that if, for whatever reason, the pipeline doesn't finish successfully you can resume it.  However, once the pipeline has completed successfully, you no longer need this directory (it can take up a lot of space) so you can delete it:
-
-```bash
-rm -rf work
-```
-
 ## Summary
 
 ::: {.callout-tip}
diff --git a/materials/17-intro_staph.md b/materials/17-intro_staph.md
index 5f69cc4..01af774 100644
--- a/materials/17-intro_staph.md
+++ b/materials/17-intro_staph.md
@@ -9,6 +9,10 @@ title: "Introduction to *Staphylococcus aureus*"
 
 :::
 
+## Course dataset
+
+van Tonder 2023
+
 ## Summary
 
 ::: {.callout-tip}
diff --git a/materials/18-de_novo_assembly.md b/materials/18-de_novo_assembly.md
index ea612d6..f5321bb 100644
--- a/materials/18-de_novo_assembly.md
+++ b/materials/18-de_novo_assembly.md
@@ -6,7 +6,6 @@ title: "de novo Assembly"
 ## Learning Objectives
 
 - Understand what *de novo* genome assembly is and how it differs from reference-based assembly.
-- Assemble sequence data with `assembleBAC`
 
 :::
 
diff --git a/materials/21-assembly_qc.md b/materials/21-assembly_qc.md
index b02f7fa..c3e4aa5 100644
--- a/materials/21-assembly_qc.md
+++ b/materials/21-assembly_qc.md
@@ -37,18 +37,17 @@ Let's now turn to some of the other metrics to help us assess our assemblies' qu
 
 ## Contiguity
 
-One of the outputs from running assembleBAC is a summary file containing the `Quast` outputs for each sample.  This file can be found in `results/assembleBAC/metadata/transposed_report.csv`.
+One of the outputs from running assembleBAC is a summary file containing the `Quast` outputs for each sample.  This file can be found in `results/assembleBAC/metadata/transposed_report.tsv`.
 
-You can open it with a spreadsheet software such as _Excel_ from our file browser <i class="fa-solid fa-folder"></i>. 
-Alternatively you can print it directly on the terminal in a nice format using the `column` command, as shown here for the samples we've been using as an example ((for brevity, we only show some columns of most interest):): 
-
-```bash
-column -t -s "," results/assembleBAC/metadata/transposed_report.csv
-```
+You can open it with a spreadsheet software such as _Excel_ from our file browser <i class="fa-solid fa-folder"></i> (for brevity, we only show the columns of most interest): 
 
 ```
 Assembly	# contigs (>= 0 bp)	# contigs (>= 1000 bp)	Total length (>= 0 bp)	Largest contig	N50
-CTMA_1432	2	2	4103466	3057934	3057934
+ERX3876932_ERR3864879_T1_contigs	77	40	2848635	484893	109559
+ERX3876949_ERR3864896_T1_contigs	70	18	2683262	493468	251833
+ERX3876930_ERR3864877_T1_contigs	84	20	2729135	557153	251805
+ERX3876908_ERR3864855_T1_contigs	45	13	2717933	792936	707553
+ERX3876945_ERR3864892_T1_contigs	30	9	2670961	1351816	1351816
 ```
 
 The columns are: 
@@ -61,10 +60,10 @@ The columns are:
 - **N50** - a metric indicating the length of the shortest fragment, from the group of fragments that together represent at least 50% of the total genome. A higher N50 value suggests better contig lengths.
 
 To interpret these statistics, it helps to compare them with other well-assembled _Staphylococcus aureus_ genomes. 
-For example, let's take the first MRSA genome that was sequenced, [XXXX]('NCBI link'), as our reference for comparison. 
-This genome is X Mb long and is composed of a single chromosome.
+For example, let's take the first MRSA genome that was sequenced, [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/), as our reference for comparison. 
+This genome is 2.8 Mb long and is composed of a single chromosome.
 
-We can see that all of our assemblies reached a total length of around XX to XX Mb, which matches the expected length from our reference genome.
+We can see that all of our assemblies reached a total length of around 2.7 to 2.9 Mb, which matches the expected length from our reference genome.
 This indicates that we managed to assemble most of the expected genome. 
 However, we can see that there is a variation in the number of fragments in the final assemblies (i.e. their contiguity). 
 For instance, isolates X and X were assembled to a small number of fragments each, suggesting good assemblies. For several other isolates our assemblies were more fragmented, in particular X and X, which had hundreds of fragments. 
@@ -72,11 +71,15 @@ This indicates less contiguous sequences.
 
 ## Completeness
 
-We now turn to assessing **genome completeness**, i.e. whether we managed to recover most of the known _Staphylococcus aureus_ genome, or whether we have large fractions missing. We can assess this by using [`CheckM2`](https://github.com/chklovski/CheckM2) which was run as part of the assembleBAC pipeline. This tool assesses the completeness of the assembled genomes based on other similar organisms in public databases, in addition to contamination scores. The output file from assembleBAC summarising the `CheckM2` results is a tab-delimited file called `checkm2_summary.tsv`. This file can be found in `results/assembleBAC/metadata/` and can be opened in a spreadsheet software such as _Excel_. Here is an example result (for brevity, we only show some columns of most interest):
+We now turn to assessing **genome completeness**, i.e. whether we managed to recover most of the known _Staphylococcus aureus_ genome, or whether we have large fractions missing. We can assess this by using [`CheckM2`](https://github.com/chklovski/CheckM2) which was run as part of the assembleBAC pipeline. This tool assesses the completeness of the assembled genomes based on other similar organisms in public databases, in addition to contamination scores. The output file from assembleBAC summarising the `CheckM2` results is a tab-delimited file called `checkm2_summary.tsv`. This file can be found in `results/assembleBAC/metadata/` and can be opened in a spreadsheet software such as _Excel_. Here is an example result (for brevity, we only show the columns of most interest):
 
 ```
-Name       Completeness  Contamination  Genome_Size  GC_Content  Total_Coding_Sequences
-isolate01  92.51         5.19           4192535      0.48        5375
+Name	Completeness	Contamination	Genome_Size	GC_Content	Total_Coding_Sequences
+ERX3876905_ERR3864852_T1_contigs	100	0.22	2743298	0.33	2585
+ERX3876907_ERR3864854_T1_contigs	100	0.49	2900162	0.33	2765
+ERX3876908_ERR3864855_T1_contigs	100	0.07	2717933	0.33	2534
+ERX3876909_ERR3864856_T1_contigs	100	0.17	2854371	0.33	2711
+ERX3876929_ERR3864876_T1_contigs	100	0.1	2675834	0.33	2464
 ```
 These columns indicate:
 
@@ -84,15 +87,13 @@ These columns indicate:
 - **Completeness** - how complete our genome was inferred to be as a percentage; this is based on the machine learning models used and the organisms present in the database.
 - **Contamination** - the percentage of the assembly estimated to be contaminated with other organisms (indicating our assembly isn't "pure"). 
 - **Genome_Size** - how big the genome is estimated to be, based on other similar genomes present in the database. 
-  The [XXXX]('NCBI link') is X Mb in total, so these values make sense. 
+  The [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/) is 2.8 Mb in total, so these values make sense. 
 - **GC_Content** - the percentage of G's and C's in the genome, which is relatively constant within a species. 
-  The _S. aureus_ GC content is X%, so again these values make sense.
+  The _S. aureus_ GC content is approximately 33%, so again these values make sense.
 - **Total_Coding_Sequences** - the total number of coding sequences (genes) that were identified by _CheckM2_. 
-  The [reference genome]('NCBI link') indicates  annotated genes, so the values obtained could be overestimated.
-
-From this analysis, we can assess that our genome assemblies are good quality, with around 90% of the genome assembled. 
-Despite earlier assessing that some of the isolates have more fragmented assemblies (e.g. isolates 5 and 9), they still have >85% completeness according to _CheckM2_. 
+  The [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/) indicates  annotated genes, so the values obtained could be overestimated.
 
+From this analysis, we can assess that our genome assemblies are good quality, with 100% of the genome assembled for all our isolates. 
 It is worth noting that the assessment from _CheckM2_ is an approximation based on other genomes in the database. 
 In diverse species such as _S. aureus_ the completeness may be underestimated, because each individual strain will only carry part of the _pangenome_ for that species.
 
@@ -123,14 +124,16 @@ We opened this table in _Excel_ and this is what we obtained:
 
 ```
 Assembly	# contigs (>= 0 bp)	# contigs (>= 1000 bp)	# contigs (>= 5000 bp)	# contigs (>= 10000 bp)	# contigs (>= 25000 bp)	# contigs (>= 50000 bp)	Total length (>= 0 bp)	Total length (>= 1000 bp)	Total length (>= 5000 bp)	Total length (>= 10000 bp)	Total length (>= 25000 bp)	Total length (>= 50000 bp)	# contigs	Largest contig	Total length	GC (%)	N50	N90	auN	L50	L90	# N's per 100 kbp
-CTMA_1432	2	2	2	2	2	2	4103466	4103466	4103466	4103466	4103466	4103466	2	3057934	4103466	47.50	3057934	1045532	2545189.2	1	2	0.00
+ERX3876932_ERR3864879_T1_contigs	77	40	36	29	25	20	2848635	2836870	2827630	2778943	2698417	2521114	47	484893	2841571	32.73	109559	41891	191514.9	6	21	0
+ERX3876949_ERR3864896_T1_contigs	70	18	16	15	14	11	2683262	2671100	2666432	2659228	2636490	2507149	21	493468	2673387	32.69	251833	78361	283091.3	4	10	0
+ERX3876930_ERR3864877_T1_contigs	84	20	20	18	15	11	2729135	2714469	2714469	2701567	2644913	2501830	25	557153	2717749	32.7	251805	52910	319559.1	4	10	0
 ```
 
-The assembly lengths obtained were all around 2 Mb, which is expected for this species. 
+The assembly lengths obtained were all around 2.7 to 2.8 Mb, which is expected for this species. 
 
-The contiguity of our assemblies seemed excellent, with only up to XX fragments. Some samples even had X fragments assembled, suggesting we managed to mostly reassemble our genomes. 
+The contiguity of our assemblies seemed excellent, with only up to 205 fragments. Some samples even had fewer than 50 fragments assembled, suggesting we managed to mostly reassemble our genomes. 
 
-Isolates XX and XX have more than X contigs. We may want to consider removing these isolates as they may affect our pan-genome calculation in the next section.
+Isolate ERX3876939_ERR3864886_T1_contigs has more than 200 contigs. A this is higher than the number obtained for the rest of our assembled genomeWe may want to consider removing this isolate as it may affect our pan-genome calculation in the next section. However, as the rest of the metrics for this genome look alright, let's leave it in for now.
 
 :::
 :::
@@ -151,14 +154,18 @@ We opened the file `results/assembleBAC/metadata/checkm2_summary.tsv` in _Excel_
 
 ```
 Name	Completeness	Contamination	Completeness_Model_Used	Translation_Table_Used	Coding_Density	Contig_N50	Average_Gene_Length	Genome_Size	GC_Content	Total_Coding_Sequences	Additional_Notes
-CTMA_1402	99.99	1.16	Neural Network (Specific Model)	11	0.865	3058794	310.65996343692865	4118479	0.47	3829	None
+ERX3876905_ERR3864852_T1_contigs	100	0.22	Neural Network (Specific Model)	11	0.841	188199	297.9439072	2743298	0.33	2585	None
+ERX3876907_ERR3864854_T1_contigs	100	0.49	Neural Network (Specific Model)	11	0.839	174940	293.8538879	2900162	0.33	2765	None
+ERX3876908_ERR3864855_T1_contigs	100	0.07	Neural Network (Specific Model)	11	0.841	707553	301.2190213	2717933	0.33	2534	None
+ERX3876909_ERR3864856_T1_contigs	100	0.17	Neural Network (Specific Model)	11	0.84	181628	295.3832534	2854371	0.33	2711	None
+ERX3876929_ERR3864876_T1_contigs	100	0.1	Neural Network (Specific Model)	11	0.841	605766	304.9176136	2675834	0.33	2464	None
 ```
 
 We can see from this that: 
 
-- We achieved ~90% completeness (according to _CheckM2_'s database) for all our samples. 
+- We achieved 100% completeness (according to _CheckM2_'s database) for all our samples. 
 - There was no evidence of strong contamination affecting our assemblies (all ~5% or below).
-- The GC content was XX%, which is what is expected for _Staphylococcus aureus_. 
+- The GC content was 33%, which is what is expected for _Staphylococcus aureus_. 
 :::
 :::
 
diff --git a/materials/29-pathogenwatch.md b/materials/29-pathogenwatch.md
index e2f0a19..3787627 100644
--- a/materials/29-pathogenwatch.md
+++ b/materials/29-pathogenwatch.md
@@ -9,7 +9,17 @@ title: "Pathogenwatch 2"
 
 :::
 
-Assemblies provided in `results/assemblebac/assemblies`
+:::{.callout-exercise}
+#### Analysing Pneumococcal genomes with Pathogenwatch
+
+Whilst you can upload FASTQ files to Pathogenwatch, it's quicker if we work with already assembled genomes.  We've provided pre-processed results for the _S. pneumoniae_ data generated with AssembleBAC in the `preprocessed/` directory which you can use to upload to Pathogenwatch in this exercise.
+
+:::{.callout-answer}
+
+
+:::
+
+
 
 ## Summary
 
diff --git a/materials/31-intro_amr.md b/materials/31-intro_amr.md
index 1a951c5..22534b0 100644
--- a/materials/31-intro_amr.md
+++ b/materials/31-intro_amr.md
@@ -3,15 +3,396 @@ title: "Introduction to Antimicrobial resistance"
 ---
 
 ::: {.callout-tip}
-## Learning Objectives
+#### Learning Objectives
 
-- Understand what AMR is.
+- Recognise the threats posed by antimicrobial resistance (AMR) to public health globally.
+- Apply both command line and web applications to find potential AMR genes in a set of genomes.
+- Recognise the limitations of computational AMR prediction and the importance of comparing results across multiple tools.
 
 :::
 
+## Antimicrobial Resistance (AMR) analysis
+
+Antimicrobial resistance (AMR) is a phenomenon where microorganisms, such as bacteria, evolve in a way that reduces the effectiveness of antimicrobial drugs, including antibiotics.
+This occurs due to the overuse and misuse of these drugs, which exerts selective pressure on the microorganisms. 
+As a result, bacteria may develop or acquire genetic changes that enable them to **survive exposure to antimicrobial agents**, making the drugs less effective or entirely ineffective. 
+AMR poses a significant **global health threat**, as it can lead to infections that are challenging to treat, potentially causing increased morbidity and mortality. 
+Efforts to combat AMR include responsible antibiotic use, developing new drugs, and enhancing infection prevention and control measures.
+
+According to the [WHO](https://www.who.int/publications/i/item/9789241564748), antimicrobial resistance (AMR) has evolved into a global concern for public health. 
+This stems from various harmful bacterial strains developing resistance to antimicrobial medications, including antibiotics. 
+As part of our analysis, we will now focus on identifying AMR patterns connected to our _S. pneumoniae_ isolates.
+
+Numerous software tools have been created to predict the presence of genes linked to AMR in genome sequences. 
+Estimating the function of a gene or protein solely from its sequence is complex, leading to varying outcomes across different software tools. 
+It is advisable to employ multiple tools and compare their findings, thus increasing our confidence in identifying which antimicrobial drugs might be more effective for treating patients infected with the strains we're studying.
+
+In this section we will introduce a workflow aimed at combining the results from various AMR tools into a unified analysis.
+We will compare its results with AMR analysis performed by _Pathogenwatch_.
+
+## _Funcscan_ workflow {#sec-funcscan}
+
+Here, we introduce an automated workflow called **[`nf-core/funcscan`](https://nf-co.re/funcscan/1.1.2)** (@fig-funcscan), which uses _Nextflow_ to manage all the software and analysis steps (see information box above).
+This pipeline uses five different AMR screening tools: **[ABRicate](https://github.com/tseemann/abricate)**, **[AMRFinderPlus (NCBI Antimicrobial Resistance Gene Finder)](https://www.ncbi.nlm.nih.gov/pathogens/antimicrobial-resistance/AMRFinder/)**, **[fARGene (Fragmented Antibiotic Resistance Gene idENntifiEr)](https://github.com/fannyhb/fargene)**, **[RGI (Resistance Gene Identifier)](https://card.mcmaster.ca/analyze/rgi)**, and **[DeepARG](https://readthedocs.org/projects/deeparg/)**.
+This is convenient, as we can obtain the results from multiple approaches in one step. 
+
+![Overview of the `nf-core/funcscan` workflow. In our case we will run the "Antimicrobial Resistance Genes (ARGs)" analysis, shown in yellow. Image source: https://nf-co.re/funcscan/1.1.2](https://raw.githubusercontent.com/nf-core/funcscan/1.1.2/docs/images/funcscan_metro_workflow.png){#fig-funcscan}
+
+This pipeline requires us to prepare a samplesheet CSV file with information about the samples we want to analyse. 
+Two columns are required: 
+
+- `sample` --> a sample name of our choice (we will use the same name that we used for the assembly).
+- `fasta` --> the path to the FASTA file corresponding to that sample.
+
+You can create this file using a spreadsheet software such as _Excel_, making sure to save the file as a CSV.
+Here is an example of our samplesheet, which we saved in a file called `samplesheet_funcscan.csv`: 
+
+```
+sample,fasta
+ERX1265396_ERR1192012_T1,preprocessed/assemblebac/assemblies/ERX1265396_ERR1192012_T1_contigs.fa
+ERX1265488_ERR1192104_T1,preprocessed/assemblebac/assemblies/ERX1265488_ERR1192104_T1_contigs.fa
+ERX1501202_ERR1430824_T1,preprocessed/assemblebac/assemblies/ERX1501202_ERR1430824_T1_contigs.fa
+ERX1501203_ERR1430825_T1,preprocessed/assemblebac/assemblies/ERX1501203_ERR1430825_T1_contigs.fa
+ERX1501204_ERR1430826_T1,preprocessed/assemblebac/assemblies/ERX1501204_ERR1430826_T1_contigs.fa
+ERX1501205_ERR1430827_T1,preprocessed/assemblebac/assemblies/ERX1501205_ERR1430827_T1_contigs.fa
+ERX1501206_ERR1430828_T1,preprocessed/assemblebac/assemblies/ERX1501206_ERR1430828_T1_contigs.fa
+ERX1501207_ERR1430829_T1,preprocessed/assemblebac/assemblies/ERX1501207_ERR1430829_T1_contigs.fa
+ERX1501208_ERR1430830_T1,preprocessed/assemblebac/assemblies/ERX1501208_ERR1430830_T1_contigs.fa
+ERX1501212_ERR1430834_T1,preprocessed/assemblebac/assemblies/ERX1501212_ERR1430834_T1_contigs.fa
+```
+
+Once we have the samplesheet ready, we can run the `nf-core/funcscan` workflow using the following commands:
+
+```bash
+# activate the environment
+mamba activate nextflow
+
+# create output directory
+mkdir results/funcscan
+
+# run the pipeline
+nextflow run nf-core/funcscan -profile singularity \
+  --max_memory 16.GB --max_cpus 8 \
+  --input samplesheet_funcscan.csv \
+  --outdir results/funcscan \
+  --run_arg_screening \
+  --arg_skip_deeparg
+```
+
+The options we used are: 
+
+- `-profile singularity` - indicates we want to use the _Singularity_ program to manage all the software required by the pipeline (another option is to use `docker`). See [Data & Setup](../../setup.md) for details about their installation.
+- `--max_memory` and `--max_cpus` - sets the available RAM memory and CPUs. You can check this with the commands `free -h` and `nproc --all`, respectively.
+- `--input` - the samplesheet with the input files, as explained above.
+- `--outdir` - the output directory for the results. 
+- `--run_arg_screening` - indicates we want to run the "antimicrobial resistance gene screening tools". There are also options to run antimicrobial peptide and biosynthetic gene cluster screening ([see documentation](https://nf-co.re/funcscan/1.1.2/parameters#screening-type-activation)).
+- `--arg_skip_deeparg` - this skips a step in the analysis which uses the software _DeepARG_. We did this simply because this software takes a very long time to run. But in a real analysis you may want to leave this option on. 
+
+While the pipeline runs, you will get a progress printed on the screen, and then a message once it finishes. 
+Here is an example from our samples:  
+
+```
+* Software dependencies
+  https://github.com/nf-core/funcscan/blob/master/CITATIONS.md
+------------------------------------------------------
+executor >  slurm (1371)
+[cf/4d7565] process > NFCORE_FUNCSCAN:FUNCSCAN:INPUT_CHECK:SAMPLESHEET_CHECK (samplesheet_funcscan.csv)   [100%] 1 of 1 ✔
+[-        ] process > NFCORE_FUNCSCAN:FUNCSCAN:GUNZIP_FASTA_PREP                                          -
+[f5/5767a9] process > NFCORE_FUNCSCAN:FUNCSCAN:BIOAWK (ERX1501218_ERR1430840_T1)                          [100%] 50 of 50 ✔
+[a9/164c9c] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:AMRFINDERPLUS_UPDATE (update)                          [100%] 1 of 1, cached: 1 ✔
+[8f/6d023e] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:AMRFINDERPLUS_RUN (ERX1501260_ERR1430882_T1)           [100%] 50 of 50 ✔
+[6a/c9f0ce] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:HAMRONIZATION_AMRFINDERPLUS (ERX1501260_ERR1430882_T1) [100%] 50 of 50 ✔
+[da/67f664] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:FARGENE (ERX1501226_ERR1430848_T1)                     [100%] 500 of 500 ✔
+[79/73f7b3] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:HAMRONIZATION_FARGENE (ERX1501226_ERR1430848_T1)       [ 99%] 519 of 520
+[0f/d47de8] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:RGI_MAIN (ERX1501263_ERR1430885_T1)                    [100%] 50 of 50 ✔
+[21/06df72] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:HAMRONIZATION_RGI (ERX1501263_ERR1430885_T1)           [100%] 50 of 50 ✔
+[eb/b67eca] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:ABRICATE_RUN (ERX1501218_ERR1430840_T1)                [100%] 50 of 50 ✔
+[d6/e5fdd4] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:HAMRONIZATION_ABRICATE (ERX1501218_ERR1430840_T1)      [100%] 50 of 50 ✔
+[-        ] process > NFCORE_FUNCSCAN:FUNCSCAN:ARG:HAMRONIZATION_SUMMARIZE                                -
+[-        ] process > NFCORE_FUNCSCAN:FUNCSCAN:CUSTOM_DUMPSOFTWAREVERSIONS                                -
+[-        ] process > NFCORE_FUNCSCAN:FUNCSCAN:MULTIQC                                                    -
+```
+
+
+### `funcscan` outputs
+
+The main output of interest from this pipeline is a CSV file, which contains a summary of the results from all the AMR tools used 
+This summary is produced by a software called [_hAMRonization_](https://github.com/pha4ge/hAMRonization) and the corresponding CSV file is saved in `results/funcscan/reports/hamronization_summarize/hamronization_combined_report.tsv`. 
+You can open this file using any standard spreadsheet software such as _Excel_ (@fig-hamronization). 
+
+This file is quite large, containing many columns and rows (we detail these columns in the information box below). 
+The easiest way to query this table is to filter the table based on the column "antimicrobial_agent" to remove rows where no AMR gene was detected (@fig-hamronization). 
+This way you are left with only the results which were positive for the AMR analysis. 
+
+![To analyse the table output by _hAMRonization_ in _Excel_ you can go to "Data" --> "Filter". Then, select the dropdown button on the "antimicrobial_agent" column and untick the box "blank". This will only show the genes associated with resistance to antimicrobial drugs.](images/amr_hamronization.svg){#fig-hamronization}
+
+:::{.callout-note collapse=true}
+#### _hAMRonization_ report columns (click to expand)
+
+TODO
+
+:::
+
+
+You can also look at the detailed results of each individual tool, which can be found in the directory `results/funcscan/arg`. 
+This directory contains sub-directories for each of the 5 AMR tools used (in our case only 4 folders, because we skipped the _DeepARG_ step):
+
+```bash
+ls results/funcscan/arg
+```
+
+```
+abricate  amrfinderplus  fargene  hamronization  rgi
+```
+
+For each individual tool's output folder shown above, there is a report, which is associated with the predicted AMRs for each of our samples. 
+In most cases, the report is in tab-delimited TSV format, which can be opened in a standard spreadsheet software such as _Excel_. 
+For instance, the AMR report from _Abricate_ for one of our samples looks like this: 
+
+```bash
+less -S results/funcscan/arg/abricate/isolate02/isolate02.txt
+```
+
+```
+#FILE            SEQUENCE  START    END      STRAND  GENE      COVERAGE     COVERAGE_MAP     GAPS  %COVERAGE  %IDENTITY  DATABASE  ACCESSION    PRODUCT                                                         RESISTANCE
+isolate02.fasta  contig_2  1696     2623     -       blaPER-7  1-927/927    ========/======  1/1   100.00     99.89      ncbi      NG_049966.1  class A extended-spectrum beta-lactamase PER-7                  CEPHALOSPORIN
+isolate02.fasta  contig_2  4895     5738     -       sul1      1-840/840    ========/======  4/4   100.00     98.93      ncbi      NG_048091.1  sulfonamide-resistant dihydropteroate synthase Sul1             SULFONAMIDE
+isolate02.fasta  contig_2  6243     7036     -       aadA2     1-792/792    ========/======  2/2   100.00     99.50      ncbi      NG_047343.1  ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA2  STREPTOMYCIN
+isolate02.fasta  contig_3  966452   967081   +       catB9     1-630/630    ===============  0/0   100.00     99.84      ncbi      NG_047621.1  type B-5 chloramphenicol O-acetyltransferase CatB9              CHLORAMPHENICOL
+isolate02.fasta  contig_4  778899   780023   +       varG      1-1125/1125  ===============  0/0   100.00     100.00     ncbi      NG_057468.1  VarG family subclass B1-like metallo-beta-lactamase             CARBAPENEM
+isolate02.fasta  contig_4  2573875  2574348  -       dfrA1     1-474/474    ===============  0/0   100.00     100.00     ncbi      NG_047676.1  trimethoprim-resistant dihydrofolate reductase DfrA1            TRIMETHOPRIM
+isolate02.fasta  contig_7  4178     5099     -       mph(A)    1-921/921    ========/======  1/1   100.00     99.35      ncbi      NG_047986.1  Mph(A) family macrolide 2'-phosphotransferase                   MACROLIDE
+isolate02.fasta  contig_7  6594     8069     +       msr(E)    1-1476/1476  ===============  0/0   100.00     100.00     ncbi      NG_048007.1  ABC-F type ribosomal protection protein Msr(E)                  MACROLIDE
+isolate02.fasta  contig_7  8125     9009     +       mph(E)    1-885/885    ===============  0/0   100.00     100.00     ncbi      NG_064660.1  Mph(E) family macrolide 2'-phosphotransferase                   MACROLIDE
+isolate02.fasta  contig_7  131405   132197   +       aadA2     1-792/792    ========/======  1/1   100.00     99.62      ncbi      NG_047343.1  ANT(3'')-Ia family aminoglycoside nucleotidyltransferase AadA2  STREPTOMYCIN
+```
+
+For this sample there were several putative AMR genes detected by _Abricate_, with their associated drugs. 
+These genes were identified based on their similarity with annotated sequences from the NCBI database.
+For example, the gene _varG_ was detected in our sample, matching the NCBI accession [NG_057468.1](https://www.ncbi.nlm.nih.gov/nuccore/NG_057468.1).
+This is annotated as as a reference for antimicrobial resistance, in this case to the drug "CARBAPENEM".
+
+:::{.callout-note}
+#### Command line trick <i class="fa-solid fa-wand-magic-sparkles"></i>
+
+Here is a trick using standard commands to count how many times each drug was identified by `funcscan`:
+
+```bash
+cat results/funcscan/reports/hamronization_summarize/hamronization_combined_report.tsv | cut -f 10 | sort | uniq -c
+```
+
+- `cat` prints the content of the file
+- `cut` extracts the 10th column from the file
+- `sort` and `uniq -c` are used in combination to count unique output values
+
+The result of the above command is: 
+
+```
+ 9 CARBAPENEM
+ 8 CEPHALOSPORIN
+ 8 CHLORAMPHENICOL
+27 MACROLIDE
+13 QUATERNARY AMMONIUM
+10 STREPTOMYCIN
+ 1 SULFONAMIDE
+10 TRIMETHOPRIM
+```
+:::
+
+
+## AMR with _Pathogenwatch_
+
+_Pathogenwatch_ also performs AMR prediction using its own [algorithm and curated gene sequences](https://cgps.gitbook.io/pathogenwatch/technical-descriptions/antimicrobial-resistance-prediction/pw-amr). 
+The results from this analysis can be seen from the individual sample report, or summarised in the collection view.
+
+![AMR analysis from _Pathogenwatch_. The summary table (top) can be accessed from the sample collections view, by selecting "Antibiotics" from the drop-down on the top-left. The table summarises resistance to a range of antibiotics (red = resistant; yellow = intermediate). More detailed results can be viewed for each individual sample by clicking on its name and opening the sample report (bottom).](images/amr_pathogenwatch.png){#fig-amr_pathogenwatch}
+
+
+## Which AMR do my isolates have?
+
+At this stage you may notice that different tools will give you a different answer to this question and it is therefore recommended to **compare the results across multiple tools**.
+For example, _Pathogenwatch_ generally detects AMR for comparatively more antimicrobial drugs compared to the `funcscan` analysis. 
+However, some of the drugs detected by `funcscan` were either not reported by _Pathogenwatch_ (possibly because they are not part of its database) or have a disagreeing result. 
+
+Let's take a specific example. 
+_Pathogenwatch_ determined that none of our isolates were resistant to Streptomycin. 
+However, in the _hAMRonization_ summary table (output by `funcscan`) we can see that this drug was reported for several of our samples. 
+Upon closer inspection, however, we can see that we only had partial matches to the reference NCBI sequence ([WP_001206356.1](https://www.ncbi.nlm.nih.gov/protein/WP_001206356.1)), or in the case of one sample with a higher match the sequence identity was less than 100% (table below, showing some of the columns from the _hAMRonization_ table).
+
+```
+input_file_name               gene_symbol   reference_accession   antimicrobial_agent   coverage_percentage   sequence_identity  
+isolate01.tsv.amrfinderplus   aadA2         WP_001206356.1        STREPTOMYCIN          68.44                 100                
+isolate02.tsv.amrfinderplus   aadA2         WP_001206356.1        STREPTOMYCIN          68.44                 100                
+isolate02.tsv.amrfinderplus   aadA2         WP_001206356.1        STREPTOMYCIN          68.44                 100                
+isolate04.tsv.amrfinderplus   aadA2         WP_001206356.1        STREPTOMYCIN          68.44                 100                
+isolate05.tsv.amrfinderplus   aadA2         WP_001206356.1        STREPTOMYCIN          68.44                 100                
+isolate06.tsv.amrfinderplus   aadA2         WP_001206356.1        STREPTOMYCIN          68.44                 100                
+isolate07.tsv.amrfinderplus   aadA2         WP_001206356.1        STREPTOMYCIN          68.44                 100                
+isolate08.tsv.amrfinderplus   aadA2         WP_001206356.1        STREPTOMYCIN          68.44                 100                
+isolate09.tsv.amrfinderplus   aadA2         WP_001206356.1        STREPTOMYCIN          68.44                 100                
+isolate10.tsv.amrfinderplus   aadA2         WP_001206356.1        STREPTOMYCIN          100                   92.05              
+```
+
+It is also important to take into consideration our earlier [assembly quality assessments](../02-assembly/04-assembly_quality.md) as they may result in **false negative results**.
+For example, we can see that "isolate05" has the lowest AMR detection of all samples. 
+However, this was the sample with the lowest genome coverage (only 21x) and with a resulting highly fragmented genome (229 fragments). 
+Therefore, it is very possible that we missed parts of its genome during assembly, and that some of those contained AMR genes or plasmids. 
+
+In conclusion, always be critical of the analysis of your results at this stage, comparing the output from different tools as well as considering the quality of your assemblies. 
+Ultimately, the safest way to assess AMR is with **experimental validation**, by testing those strains against the relevant antimicrobial agents in the lab. 
+However, computational analysis such as what we did can help inform these experiments and treatment decisions.
+
+
+## Exercises
+
+<i class="fa-solid fa-triangle-exclamation" style="color: #1e3050;"></i> 
+For these exercises, you can either use the dataset we provide in [**Data & Setup**](../../setup.md), or your own data. 
+You also need to have completed the genome assembly exercise in @sec-ex-assembly.
+
+:::{.callout-exercise}
+#### Funcscan workflow
+
+Run the `nf-core/funcscan` workflow on the assembled genomes for your samples. 
+
+- Using _Excel_, create a samplesheet CSV file for your samples, required as input for this pipeline. See @sec-funcscan if you need to revise the format of this samplesheet.
+- Activate the software environment: `mamba activate nextflow`.
+- Fix the script provided in `scripts/07-amr.sh`.
+- Run the script using `bash scripts/07-amr.sh`.
+
+Once the workflow is running it will print a progress message on the screen. 
+This will take a while to run, so you can do the next exercise, and then continue with this one. 
+
+Once the analysis finishes, open the output file `results/funcscan/reports/hamronization_summarize/hamronization_combined_report.tsv` in _Excel_. 
+Answer the following questions: 
+
+- Did this analysis find evidence for antimicrobial resistance to any drugs?
+- Did all your samples show evidence for AMR?
+
+
+:::{.callout-answer}
+
+We created a samplesheet for our samples in _Excel_, making sure to "Save As..." CSV format. 
+The raw file looks like this: 
+
+```
+sample,fasta
+CTMA_1402,results/assemblies/CTMA_1402.fasta
+CTMA_1421,results/assemblies/CTMA_1421.fasta
+CTMA_1427,results/assemblies/CTMA_1427.fasta
+CTMA_1432,results/assemblies/CTMA_1432.fasta
+CTMA_1473,results/assemblies/CTMA_1473.fasta
+```
+
+The fixed script is: 
+
+```bash
+#!/bin/bash
+
+# create output directory
+mkdir results/funcscan
+
+# run the pipeline
+nextflow run nf-core/funcscan -profile singularity \
+  --max_memory 16.GB --max_cpus 8 \
+  --input samplesheet_funcscan.csv \
+  --outdir results/funcscan \
+  --run_arg_screening \
+  --arg_skip_deeparg
+```
+
+While the script was running we got a progress of the analysis printed on the screen. 
+Once it finished we got a message like this (yours will look slightly different): 
+
+```
+Completed at: 12-Aug-2023 12:24:03
+Duration    : 44m 54s
+CPU hours   : 3.0
+Succeeded   : 277
+```
+
+We opened the _hAMRonization_ output report file in _Excel_ and filtered it for the column "antimicrobial_agent". 
+We identified the following (only a few columns shown for simplicity):
+
+```
+input_file_name              gene_symbol  antimicrobial_agent
+CTMA_1402.tsv.amrfinderplus  aph(6)-Id    STREPTOMYCIN
+CTMA_1402.tsv.amrfinderplus  catB9        CHLORAMPHENICOL
+CTMA_1402.tsv.amrfinderplus  dfrA1        TRIMETHOPRIM
+CTMA_1402.tsv.amrfinderplus  floR         CHLORAMPHENICOL/FLORFENICOL
+CTMA_1402.tsv.amrfinderplus  sul2         SULFONAMIDE
+CTMA_1402.tsv.amrfinderplus  varG         CARBAPENEM
+CTMA_1421.tsv.amrfinderplus  aph(3'')-Ib  STREPTOMYCIN
+CTMA_1421.tsv.amrfinderplus  aph(6)-Id    STREPTOMYCIN
+CTMA_1421.tsv.amrfinderplus  catB9        CHLORAMPHENICOL
+CTMA_1421.tsv.amrfinderplus  dfrA1        TRIMETHOPRIM
+CTMA_1421.tsv.amrfinderplus  floR         CHLORAMPHENICOL/FLORFENICOL
+CTMA_1421.tsv.amrfinderplus  sul2         SULFONAMIDE
+CTMA_1421.tsv.amrfinderplus  varG         CARBAPENEM
+CTMA_1427.tsv.amrfinderplus  aph(3'')-Ib  STREPTOMYCIN
+CTMA_1427.tsv.amrfinderplus  aph(6)-Id    STREPTOMYCIN
+CTMA_1427.tsv.amrfinderplus  catB9        CHLORAMPHENICOL
+CTMA_1427.tsv.amrfinderplus  dfrA1        TRIMETHOPRIM
+CTMA_1427.tsv.amrfinderplus  floR         CHLORAMPHENICOL/FLORFENICOL
+CTMA_1427.tsv.amrfinderplus  sul2         SULFONAMIDE
+CTMA_1427.tsv.amrfinderplus  varG         CARBAPENEM
+CTMA_1432.tsv.amrfinderplus  aph(3'')-Ib  STREPTOMYCIN
+CTMA_1432.tsv.amrfinderplus  aph(6)-Id    STREPTOMYCIN
+CTMA_1432.tsv.amrfinderplus  catB9        CHLORAMPHENICOL
+CTMA_1432.tsv.amrfinderplus  dfrA1        TRIMETHOPRIM
+CTMA_1432.tsv.amrfinderplus  floR         CHLORAMPHENICOL/FLORFENICOL
+CTMA_1432.tsv.amrfinderplus  sul2         SULFONAMIDE
+CTMA_1432.tsv.amrfinderplus  varG         CARBAPENEM
+CTMA_1473.tsv.amrfinderplus  aph(3'')-Ib  STREPTOMYCIN
+CTMA_1473.tsv.amrfinderplus  aph(6)-Id    STREPTOMYCIN
+CTMA_1473.tsv.amrfinderplus  catB9        CHLORAMPHENICOL
+CTMA_1473.tsv.amrfinderplus  dfrA1        TRIMETHOPRIM
+CTMA_1473.tsv.amrfinderplus  floR         CHLORAMPHENICOL/FLORFENICOL
+CTMA_1473.tsv.amrfinderplus  sul2         SULFONAMIDE
+CTMA_1473.tsv.amrfinderplus  varG         CARBAPENEM
+```
+
+All 5 of our samples show evidence of AMR to different antimicrobial drugs. 
+All of them are quite similar, with resistance to a similar set of drugs. 
+
+:::
+:::
+
+:::{.callout-exercise}
+#### AMR with _Pathogenwatch_
+
+Following from the _Pathogenwatch_ exercise in @sec-ex-pathogenwatch, open the "Ambroise 2023" collection that you created and answer the following questions:
+
+- Open the antibiotics summary table.
+- Do all your samples have evidence for antibiotic resistance?
+- If any samples have resistance to much fewer antibiotics compared to the others, do you think this could be related to assembly quality?
+- How do the results from _Pathogenwatch_ compare to those from `nf-core/funcscan`?
+
+How do the results compare with Pathogenwatch?
+
+:::{.callout-answer}
+
+We can open the "Antibiotics" table from the top-left dropdown menu, as shown in the image below. 
+
+![](images/pathogenwatch-ambroise04.png)
+
+We can see that _Pathogenwatch_ identified resistance to several antibiotics. 
+All samples are similar, except "1432" doesn't have resistance to furazolidone and nitrofurantoin. 
+This sample had high completeness, according to _CheckM2_. 
+However, it was also the sample with the lowest sequencing coverage (from our analysis during the genome assembly in @sec-ex-assembly), so this could be a "false negative" result due to the inability to cover some of the genes that confer AMR.
+
+All of the drugs identified by _funcscan_ were also identified by _Pathogenwatch_ (note that sulfamethoxazole and sulfisoxazole identified by _Pathogenwatch_ are both sulfonamide-derived drugs, reported by funcscan). 
+However, _Pathogenwatch_ identified resistance to several other drugs: ampicilin, ceftazidime, cephalosporins, ceftriazone, cefepime, furazolidone and nitrofurantoin. 
+:::
+:::
+
 ## Summary
 
 ::: {.callout-tip}
-## Key Points
+#### Key Points
 
+- AMR poses significant global public health threats by diminishing the effectiveness of antibiotics, making it challenging to treat infectious diseases effectively.
+- AMR software aims to identify specific genes or mutations known to confer resistance to antimicrobial agents. These tools compare input genetic sequences to known resistance genes or patterns in their associated databases.
+- The `nf-core/funcscan` workflow performs AMR analysis using several software tools and producing a summary of their results as a CSV file.
+- _Pathogenwatch_ is a more user-friendly application, which performs AMR using its own curated database. 
+- AMR prediction can result in false results (either false positives or false negatives). One way to overcome this limitation is to compare the results from multiple tools and, whenever possible, complement it with validation assays in the lab.
 :::
\ No newline at end of file
diff --git a/materials/appendices/02-course_software.md b/materials/appendices/02-course_software.md
index 828f50f..09e2ed6 100644
--- a/materials/appendices/02-course_software.md
+++ b/materials/appendices/02-course_software.md
@@ -60,6 +60,10 @@ Sequence Type assignment
 
 summarises and creates visualizations for outputs from `fastQC`, `fastp` and `MultiQC`
 
+### [Nextflow](https://www.nextflow.io/)
+
+### [pairsnp](https://github.com/gtonkinhill/pairsnp)
+
 ### [Panaroo](https://gtonkinhill.github.io/panaroo/)
 
 ### [Quast](https://quast.sourceforge.net/)
@@ -70,10 +74,14 @@ assembly metrics
 
 downsamples fastq files to 100X by default (Optional)
 
+### [remove_blocks_from_aln.py](https://github.com/sanger-pathogens/remove_blocks_from_aln)
+
 ### [SAMtools](https://sourceforge.net/projects/samtools/files/samtools/)
 
 sorts and indexes alignments 
 
+### [seqtk](https://github.com/lh3/seqtk)
+
 ### [Shovill](https://github.com/tseemann/shovill) 
 
 *de novo* genome assembly