update 151123

cambiotraining · Nov 15, 2023 · 98c47b0 · 98c47b0
1 parent 572d4f3
commit 98c47b0
Showing 11 changed files with 460 additions and 66 deletions.
diff --git a/materials/02-know_your_bug.md b/materials/02-know_your_bug.md
@@ -13,6 +13,12 @@ Before starting your analysis, it's important to understand the characteristics
 
 ![Flowchart showing different approaches to building a bacterial phylogenetic tree](images/know_your_bug.jpeg)
 
+This week, we're going to work with three different bacterial species that each require a different analysis path through the diagram above.
+
+- _Mycobacterium tuberculosis_ - a monoclonal (i.e. relatively little genomic diversity) organism that doesn't recombine so we'll build a phylogenetic tree by mapping to a reference.
+- _Staphylococcus aureus_ - a diverse organism with low levels of recombination. We're going to be working with a dataset consisting of more than a single lineage so we'll take the approach of building a core gene alignment to use to infer our phylogenetic tree.
+- _Streptococcus pneumoniae_ - a diverse organism that frequently recombines. As the dataset consists of isolates from a single lineage/serotype, we're going to map to a reference and identify and remove recombinant regions before building our phylogenetic tree.
+
 ## Summary
 
 ::: {.callout-tip}

diff --git a/materials/03-nextflow.md b/materials/03-nextflow.md
@@ -100,8 +100,6 @@ Nextflow provides a separation between the pipeline’s functional logic and the
 
 Nextflow provides out-of-the-box support for major batch schedulers and cloud platforms such as Sun Grid Engine, SLURM job scheduler, AWS Batch service and Kubernetes. A full list can be found [here](https://www.nextflow.io/docs/latest/executor.html).
 
-## Nextflow Tower/Seqera
-
 ## Snakemake
 
 In this tutorial we've focused on Nextflow but many people in the bioinformatics community use Snakemake.  Similar to Nextflow, the Snakemake workflow management system is a tool for creating reproducible and scalable data analyses. The main difference is that workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.

diff --git a/materials/05-intro_tb.md b/materials/05-intro_tb.md
@@ -21,7 +21,7 @@ Increasingly, *M. tuberculosis* is resistant to many of the frontline antimicrob
 
 ## Course dataset
 
-We will be analysing a dataset of XX Namibian *M. tuberculosis* genomes that was recently published (Claasens 2022). The original dataset consisted of 136 drug-resistant TB isolates collected from patients between 2016-2018 across Namibia. For the purposes of this course and for time, we're only going to analyse 50 genomes from the dataset.
+We will be analysing a dataset of Namibian *M. tuberculosis* genomes that was recently published (Claasens 2022). The original dataset consisted of 136 drug-resistant TB isolates collected from patients between 2016-2018 across Namibia. For the purposes of this course, mainly to save time,  we're only going to analyse 50 genomes from the dataset.
 
 ## MTBC ancestral reference sequence
 

diff --git a/materials/07-bacqc.md b/materials/07-bacqc.md
@@ -304,6 +304,18 @@ With respect to potential species contamination, no samples contained more than
 :::
 :::
 
+:::{.callout-warning}
+
+**The Nextflow work directory**
+
+Each step of the pipeline produces one or more files that are not saved to the results directory but are kept in the work directory.  This means that if, for whatever reason, the pipeline doesn't finish successfully you can resume it.  However, once the pipeline has completed successfully, you no longer need this directory (it can take up a lot of space) so you can delete it:
+
+```bash
+rm -rf work
+```
+
+:::
+
 ## Summary
 
 ::: {.callout-tip}

diff --git a/materials/09-bactmap.md b/materials/09-bactmap.md
@@ -161,29 +161,6 @@ This section of the report shows the software run as part of nf-core/bactmap and
 
 ![nf-core/bactmap MultiQC software versions](images/bactmap_software_versions.png)
 
-### The pseudogenomes directory
-
-This directory contains the files that are most useful for our downstream analyses.  Change to the directory and list the files:
-
-```bash
-cd bactmap_results/pseudogenomes
-
-ls
-```
-
-You will see the pseudogenome fasta files (a version of the reference file where the sample variants have been inserted and low-quality or missing data has been masked) for each sample, an alignment of all the samples  and the reference sequence (`aligned_pseudogenomes.fas`), and a tsv file of genomes removed from the complete alignment due to poor mapping (`low_quality_pseudogenomes.tsv`):
-
-```bash
-aligned_pseudogenomes.fas
-low_quality_pseudogenomes.tsv
-```
-
-Let's check to see if any of our samples were removed from our alignment:
-
-```bash
-cat low_quality_pseudogenomes.tsv
-```
-This returns an empty file; all of our samples mapped well enough to the reference to be included in our complete alignment.  That means we can proceed to the next step of our analysis: phylogenetic tree inference.
 
 ## Check how much of the reference was mapped
 
@@ -218,14 +195,6 @@ remove_blocks_from_aln.py -a aligned_pseudogenomes.fas -t ../../../resources/mas
 
 bash scripts/04-mask_pseudogenome.sh
 ```
-### The work directory
-
-Each step of the pipeline produces one or more files that are not saved to the results directory but are kept in the work directory.  This means that if, for whatever reason, the pipeline doesn't finish successfully you can resume it.  However, once the pipeline has completed successfully, you no longer need this directory (it can take up a lot of space) so you can delete it:
-
-```bash
-rm -rf work
-```
-
 ## Summary
 
 ::: {.callout-tip}

diff --git a/materials/17-intro_staph.md b/materials/17-intro_staph.md
@@ -9,6 +9,10 @@ title: "Introduction to *Staphylococcus aureus*"
 
 :::
 
+## Course dataset
+
+van Tonder 2023
+
 ## Summary
 
 ::: {.callout-tip}

diff --git a/materials/18-de_novo_assembly.md b/materials/18-de_novo_assembly.md
@@ -6,7 +6,6 @@ title: "de novo Assembly"
 ## Learning Objectives
 
 - Understand what *de novo* genome assembly is and how it differs from reference-based assembly.
-- Assemble sequence data with `assembleBAC`
 
 :::
 

diff --git a/materials/21-assembly_qc.md b/materials/21-assembly_qc.md
@@ -37,18 +37,17 @@ Let's now turn to some of the other metrics to help us assess our assemblies' qu
 
 ## Contiguity
 
-One of the outputs from running assembleBAC is a summary file containing the `Quast` outputs for each sample.  This file can be found in `results/assembleBAC/metadata/transposed_report.csv`.
+One of the outputs from running assembleBAC is a summary file containing the `Quast` outputs for each sample.  This file can be found in `results/assembleBAC/metadata/transposed_report.tsv`.
 
-You can open it with a spreadsheet software such as _Excel_ from our file browser <i class="fa-solid fa-folder"></i>. 
-Alternatively you can print it directly on the terminal in a nice format using the `column` command, as shown here for the samples we've been using as an example ((for brevity, we only show some columns of most interest):): 
-
-```bash
-column -t -s "," results/assembleBAC/metadata/transposed_report.csv
-```
+You can open it with a spreadsheet software such as _Excel_ from our file browser <i class="fa-solid fa-folder"></i> (for brevity, we only show the columns of most interest): 
 
 ```
 Assembly	# contigs (>= 0 bp)	# contigs (>= 1000 bp)	Total length (>= 0 bp)	Largest contig	N50
-CTMA_1432	2	2	4103466	3057934	3057934
+ERX3876932_ERR3864879_T1_contigs	77	40	2848635	484893	109559
+ERX3876949_ERR3864896_T1_contigs	70	18	2683262	493468	251833
+ERX3876930_ERR3864877_T1_contigs	84	20	2729135	557153	251805
+ERX3876908_ERR3864855_T1_contigs	45	13	2717933	792936	707553
+ERX3876945_ERR3864892_T1_contigs	30	9	2670961	1351816	1351816
 ```
 
 The columns are: 
@@ -61,38 +60,40 @@ The columns are:
 - **N50** - a metric indicating the length of the shortest fragment, from the group of fragments that together represent at least 50% of the total genome. A higher N50 value suggests better contig lengths.
 
 To interpret these statistics, it helps to compare them with other well-assembled _Staphylococcus aureus_ genomes. 
-For example, let's take the first MRSA genome that was sequenced, [XXXX]('NCBI link'), as our reference for comparison. 
-This genome is X Mb long and is composed of a single chromosome.
+For example, let's take the first MRSA genome that was sequenced, [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/), as our reference for comparison. 
+This genome is 2.8 Mb long and is composed of a single chromosome.
 
-We can see that all of our assemblies reached a total length of around XX to XX Mb, which matches the expected length from our reference genome.
+We can see that all of our assemblies reached a total length of around 2.7 to 2.9 Mb, which matches the expected length from our reference genome.
 This indicates that we managed to assemble most of the expected genome. 
 However, we can see that there is a variation in the number of fragments in the final assemblies (i.e. their contiguity). 
 For instance, isolates X and X were assembled to a small number of fragments each, suggesting good assemblies. For several other isolates our assemblies were more fragmented, in particular X and X, which had hundreds of fragments. 
 This indicates less contiguous sequences.
 
 ## Completeness
 
-We now turn to assessing **genome completeness**, i.e. whether we managed to recover most of the known _Staphylococcus aureus_ genome, or whether we have large fractions missing. We can assess this by using [`CheckM2`](https://github.com/chklovski/CheckM2) which was run as part of the assembleBAC pipeline. This tool assesses the completeness of the assembled genomes based on other similar organisms in public databases, in addition to contamination scores. The output file from assembleBAC summarising the `CheckM2` results is a tab-delimited file called `checkm2_summary.tsv`. This file can be found in `results/assembleBAC/metadata/` and can be opened in a spreadsheet software such as _Excel_. Here is an example result (for brevity, we only show some columns of most interest):
+We now turn to assessing **genome completeness**, i.e. whether we managed to recover most of the known _Staphylococcus aureus_ genome, or whether we have large fractions missing. We can assess this by using [`CheckM2`](https://github.com/chklovski/CheckM2) which was run as part of the assembleBAC pipeline. This tool assesses the completeness of the assembled genomes based on other similar organisms in public databases, in addition to contamination scores. The output file from assembleBAC summarising the `CheckM2` results is a tab-delimited file called `checkm2_summary.tsv`. This file can be found in `results/assembleBAC/metadata/` and can be opened in a spreadsheet software such as _Excel_. Here is an example result (for brevity, we only show the columns of most interest):
 
 ```
-Name       Completeness  Contamination  Genome_Size  GC_Content  Total_Coding_Sequences
-isolate01  92.51         5.19           4192535      0.48        5375
+Name	Completeness	Contamination	Genome_Size	GC_Content	Total_Coding_Sequences
+ERX3876905_ERR3864852_T1_contigs	100	0.22	2743298	0.33	2585
+ERX3876907_ERR3864854_T1_contigs	100	0.49	2900162	0.33	2765
+ERX3876908_ERR3864855_T1_contigs	100	0.07	2717933	0.33	2534
+ERX3876909_ERR3864856_T1_contigs	100	0.17	2854371	0.33	2711
+ERX3876929_ERR3864876_T1_contigs	100	0.1	2675834	0.33	2464
 ```
 These columns indicate:
 
 - **Name** - our sample name.
 - **Completeness** - how complete our genome was inferred to be as a percentage; this is based on the machine learning models used and the organisms present in the database.
 - **Contamination** - the percentage of the assembly estimated to be contaminated with other organisms (indicating our assembly isn't "pure"). 
 - **Genome_Size** - how big the genome is estimated to be, based on other similar genomes present in the database. 
-  The [XXXX]('NCBI link') is X Mb in total, so these values make sense. 
+  The [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/) is 2.8 Mb in total, so these values make sense. 
 - **GC_Content** - the percentage of G's and C's in the genome, which is relatively constant within a species. 
-  The _S. aureus_ GC content is X%, so again these values make sense.
+  The _S. aureus_ GC content is approximately 33%, so again these values make sense.
 - **Total_Coding_Sequences** - the total number of coding sequences (genes) that were identified by _CheckM2_. 
-  The [reference genome]('NCBI link') indicates  annotated genes, so the values obtained could be overestimated.
-
-From this analysis, we can assess that our genome assemblies are good quality, with around 90% of the genome assembled. 
-Despite earlier assessing that some of the isolates have more fragmented assemblies (e.g. isolates 5 and 9), they still have >85% completeness according to _CheckM2_. 
+  The [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/) indicates  annotated genes, so the values obtained could be overestimated.
 
+From this analysis, we can assess that our genome assemblies are good quality, with 100% of the genome assembled for all our isolates. 
 It is worth noting that the assessment from _CheckM2_ is an approximation based on other genomes in the database. 
 In diverse species such as _S. aureus_ the completeness may be underestimated, because each individual strain will only carry part of the _pangenome_ for that species.
 
@@ -123,14 +124,16 @@ We opened this table in _Excel_ and this is what we obtained:
 
 ```
 Assembly	# contigs (>= 0 bp)	# contigs (>= 1000 bp)	# contigs (>= 5000 bp)	# contigs (>= 10000 bp)	# contigs (>= 25000 bp)	# contigs (>= 50000 bp)	Total length (>= 0 bp)	Total length (>= 1000 bp)	Total length (>= 5000 bp)	Total length (>= 10000 bp)	Total length (>= 25000 bp)	Total length (>= 50000 bp)	# contigs	Largest contig	Total length	GC (%)	N50	N90	auN	L50	L90	# N's per 100 kbp
-CTMA_1432	2	2	2	2	2	2	4103466	4103466	4103466	4103466	4103466	4103466	2	3057934	4103466	47.50	3057934	1045532	2545189.2	1	2	0.00
+ERX3876932_ERR3864879_T1_contigs	77	40	36	29	25	20	2848635	2836870	2827630	2778943	2698417	2521114	47	484893	2841571	32.73	109559	41891	191514.9	6	21	0
+ERX3876949_ERR3864896_T1_contigs	70	18	16	15	14	11	2683262	2671100	2666432	2659228	2636490	2507149	21	493468	2673387	32.69	251833	78361	283091.3	4	10	0
+ERX3876930_ERR3864877_T1_contigs	84	20	20	18	15	11	2729135	2714469	2714469	2701567	2644913	2501830	25	557153	2717749	32.7	251805	52910	319559.1	4	10	0
 ```
 
-The assembly lengths obtained were all around 2 Mb, which is expected for this species. 
+The assembly lengths obtained were all around 2.7 to 2.8 Mb, which is expected for this species. 
 
-The contiguity of our assemblies seemed excellent, with only up to XX fragments. Some samples even had X fragments assembled, suggesting we managed to mostly reassemble our genomes. 
+The contiguity of our assemblies seemed excellent, with only up to 205 fragments. Some samples even had fewer than 50 fragments assembled, suggesting we managed to mostly reassemble our genomes. 
 
-Isolates XX and XX have more than X contigs. We may want to consider removing these isolates as they may affect our pan-genome calculation in the next section.
+Isolate ERX3876939_ERR3864886_T1_contigs has more than 200 contigs. A this is higher than the number obtained for the rest of our assembled genomeWe may want to consider removing this isolate as it may affect our pan-genome calculation in the next section. However, as the rest of the metrics for this genome look alright, let's leave it in for now.
 
 :::
 :::
@@ -151,14 +154,18 @@ We opened the file `results/assembleBAC/metadata/checkm2_summary.tsv` in _Excel_
 
 ```
 Name	Completeness	Contamination	Completeness_Model_Used	Translation_Table_Used	Coding_Density	Contig_N50	Average_Gene_Length	Genome_Size	GC_Content	Total_Coding_Sequences	Additional_Notes
-CTMA_1402	99.99	1.16	Neural Network (Specific Model)	11	0.865	3058794	310.65996343692865	4118479	0.47	3829	None
+ERX3876905_ERR3864852_T1_contigs	100	0.22	Neural Network (Specific Model)	11	0.841	188199	297.9439072	2743298	0.33	2585	None
+ERX3876907_ERR3864854_T1_contigs	100	0.49	Neural Network (Specific Model)	11	0.839	174940	293.8538879	2900162	0.33	2765	None
+ERX3876908_ERR3864855_T1_contigs	100	0.07	Neural Network (Specific Model)	11	0.841	707553	301.2190213	2717933	0.33	2534	None
+ERX3876909_ERR3864856_T1_contigs	100	0.17	Neural Network (Specific Model)	11	0.84	181628	295.3832534	2854371	0.33	2711	None
+ERX3876929_ERR3864876_T1_contigs	100	0.1	Neural Network (Specific Model)	11	0.841	605766	304.9176136	2675834	0.33	2464	None
 ```
 
 We can see from this that: 
 
-- We achieved ~90% completeness (according to _CheckM2_'s database) for all our samples. 
+- We achieved 100% completeness (according to _CheckM2_'s database) for all our samples. 
 - There was no evidence of strong contamination affecting our assemblies (all ~5% or below).
-- The GC content was XX%, which is what is expected for _Staphylococcus aureus_. 
+- The GC content was 33%, which is what is expected for _Staphylococcus aureus_. 
 :::
 :::
 

diff --git a/materials/29-pathogenwatch.md b/materials/29-pathogenwatch.md
@@ -9,7 +9,17 @@ title: "Pathogenwatch 2"
 
 :::
 
-Assemblies provided in `results/assemblebac/assemblies`
+:::{.callout-exercise}
+#### Analysing Pneumococcal genomes with Pathogenwatch
+
+Whilst you can upload FASTQ files to Pathogenwatch, it's quicker if we work with already assembled genomes.  We've provided pre-processed results for the _S. pneumoniae_ data generated with AssembleBAC in the `preprocessed/` directory which you can use to upload to Pathogenwatch in this exercise.
+
+:::{.callout-answer}
+
+
+:::
+
+
 
 ## Summary