update 281123

cambiotraining · Nov 28, 2023 · b400940 · b400940
1 parent 109fe45
commit b400940
Show file tree

Hide file tree

Showing 54 changed files with 922 additions and 388 deletions.
diff --git a/materials/.DS_Store b/materials/.DS_Store
diff --git a/materials/07-bacqc.md b/materials/07-bacqc.md
@@ -110,7 +110,13 @@ Now, you can open this file in `Excel` and edit the path to the data (it's good
 
 ## Running bacQC
 
-Now that we have the samplesheet, we can run the `bacQC` pipeline.  There are [many options](https://github.com/avantonder/bacQC/blob/main/docs/parameters.md) that can be used to customise the pipeline but a typical command is shown below:
+Now that we have the samplesheet, we can run the `bacQC` pipeline.  First, let's activate the `nextflow` software environment:
+
+```bash
+mamba activate nextflow
+```
+
+There are [many options](https://github.com/avantonder/bacQC/blob/main/docs/parameters.md) that can be used to customise the pipeline but a typical command is shown below:
 
 ```bash
 nextflow run avantonder/bacQC \

diff --git a/materials/09-bactmap.md b/materials/09-bactmap.md
@@ -42,7 +42,7 @@ Along with the outputs produced by the above tools, the pipeline produces the fo
 
 ## Running nf-core/bactmap
 
-The bactmap pipeline requires a samplesheet CSV file in the same format as the one we used for bacQC so we can re-use that samplesheet CSV file. If you decided to remove any samples because they didn't pass the QC, then edit the samplesheet CSV file accordingly. There are [many options](https://github.com/nf-core/bactmap/blob/1.0.0/docs/usage.md) that can be used to customise the pipeline but a typical command is shown below:
+The bactmap pipeline requires a samplesheet CSV file in the same format as the one we used for bacQC so we can re-use that samplesheet CSV file. If you decided to remove any samples because they didn't pass the QC, then edit the samplesheet CSV file accordingly. There are [many options](https://github.com/nf-core/bactmap/blob/1.0.0/docs/usage.md) that can be used to customise the pipeline but a typical command is shown below (check that your nextflow environment is still active):
 
 ```bash
 nextflow run nf-core/bactmap \
@@ -320,7 +320,7 @@ Original alignment length: 4435783	New alignment length:4435783
 Done.
 ```
 
-The masked final alignment will saved to the `results/bactmap/masked_alignment/` directory.  
+The masked final alignment will be saved to the `results/bactmap/masked_alignment/` directory.  
 
 Alternatively we've provided a script, `04-mask_pseudogenome.sh` in the `scripts` directory which could be used instead with `bash`:
 

diff --git a/materials/10-phylogenetics.md b/materials/10-phylogenetics.md
@@ -184,7 +184,15 @@ We run `IQ-TREE` on the output from `snp-sites`, i.e. using the variable sites e
 mkdir results/iqtree
 
 # run iqtree2
-iqtree -s results/snp-sites/core_gene_alignment_snps.aln -fconst 692240,1310839,1306835,691662 --prefix results/iqtree/Nam_TB -nt AUTO -ntmax 8 -mem 8G -m GTR+F+I -bb 1000
+iqtree \
+  -s results/snp-sites/core_gene_alignment_snps.aln \
+  -fconst 692240,1310839,1306835,691662 \
+  --prefix results/iqtree/Nam_TB \
+  -nt AUTO \
+  -ntmax 8 \
+  -mem 8G \
+  -m GTR+F+I \
+  -bb 1000
 ```
 
 The options used are: 

diff --git a/materials/12-tree_visualization.md b/materials/12-tree_visualization.md
@@ -9,6 +9,78 @@ title: "Visualising phylogenies"
 
 :::
 
+There are many programs that can be used to visualise phylogenetic trees.  Some of the popular programs include `FigTree`, `iTOL` and the R library `ggtree`.  For this course, we're going to use the web-based tool `Microreact` as it allows users to interactively manipulate the tree, add metadata and generate other plots including maps and histograms of metadata variables in a single interface.
+
+![The Microreact landing page](images/microreact_landing.png)
+
+In order to use this platform you will first need to [**create an account**](https://microreact.org/api/auth/signin) (or sign-in through your existing Google, Facebook or Twitter).
+
+## Uploading tree files and metadata
+
+Once you've logged into Microreact, you can upload the Namibian tree file (`Nam_TB.treefile`) and combined metadata TSV file (`TB_metadata.tsv`) we created earlier. 
+
+1. First, copy `Nam_TB.treefile` to the analysis directory so that both files are in the same location:
+
+```bash
+cp results/iqtree/Nam_TB.treefile .
+```
+
+2. Click on the **UPLOAD** link in the top-right corner of the page:
+
+![](images/microreact_upload1.png)
+
+3. Click the **+** button on the bottom-right corner then **Browse Files** to upload the files:
+
+![](images/microreact_upload2.png)
+
+4. This will open a file browser, where you can select the tree file and metadata from your local machine. Go to the `M_tuberculosis` folder where you have the results we've generated so far this week. Click and select the `Nam_TB.treefile` and `TB_metadata.tsv` files while holding the <kbd>Ctrl</kbd> key. Click Open on the dialogue window after you have selected both files.
+
+![](images/microreact_upload3.png)
+
+5. A new dialogue box will open with files you've uploaded and the File kind which is done automatically by Microreact.  As the File kind for both files is correct, go ahead and click **CONTINUE**:
+
+![](images/microreact_upload4.png)
+
+6. Microreact will load the data and process it.  The final step before we can have a look at the tree and annotate it is to confirm to Microreact which column in the metadata corresponds to the tip labels in the tree so it can match them.  By default Microreact will use the first column, in this case `sample` which is correct so click **CONTINUE**:
+
+![](images/microreact_upload5.png)
+
+7. You should now see three windows in front of you.  The top-left has a map with the locations of your isolates based on the longitude and latitude values included in `TB_metadata.tsv`.  The top-right has the phylogenetic tree with a separate colour for each tip (by default Microreact will colour the tips by `BorstelID`).  Across the bottom you have the metadata from `TB_metadata.tsv`.
+
+![](images/microreact_loaded.png)
+
+8. The first thing we're going to do is change the colour of the tip nodes to **Region**.  Click on the **Eye** icon in the top-right hand corner and change **Colour Column** to **Region**: 
+
+![](images/microreact_region1.png)
+
+9. This will change the colour of the tip nodes as well as the pie charts on the map - each region has its own colour as you'd expect:
+
+![](images/microreact_region2.png)
+
+10. At this point, before we proceed any further, let's save the project to your accounts. Click on the **Save** button in the top-right corner, change the project name to **Namibia TB** and add some kind of description so you know what the dataset is.  Then click **Save as a New Project** (another dialogue box will appear asking if you want to share your project; for now, close this box):
+
+![](images/microreact_save.png)
+
+11. Now, let's make our phylogenetic tree a bit more informative.  First, let's add the tip labels to the display by clicking on the left-hand of the two buttons in the phylogeny window then the drop down arrow next to **Nodes & Labels**. Now click the slider next to **Leaf Labels** and the slider next to **Align Leaf Labels**. We'll also make the text a little smaller by moving the slider to _12px_:
+
+![](images/microreact_tips1.png)
+
+12. The tip labels are still the European Nuclotide Accessions we used to download the FASTQ files. Let's change the tip labels to the `BorstelID` which is what's used in the paper.  Click on the **Eye** icon again and change **Labels Column** to `BorstelID`:
+
+![](images/microreact_tips2.png)
+
+13. It's often useful to root a phylogenetic tree as it will more accurately reflect the relationships between our samples.  As we have the ancestral reference sequence `MTBC0` which we used to build the tree included in our tree, we can use this as our root. To root the tree, hover over the node that joins `MTBC0` to three other samples and right click when a circle appears. Then click **Set as Root (Re-root)**:
+
+![](images/microreact_root.png)
+
+14. From the rooted tree, we can see we have two distinct clades within the tree.  These are the two major lineages we identified in our dataset (Lineage 1 and Lineage 2). To make this clearer, change the colour of the tip nodes to `main_lineage` and click on **Legend** on the far right-hand side of the plot. Now we have a tree and map annotated with the two lineages in our dataset:
+
+![](images/microreact_lineage.png)
+
+15. The last thing we're going to do is add a histogram to show the frequency of lineages across the different regions to our Microreact window. Click on the **Pencil** icon on the top-right corner and click **Create New Chart** then move your mouse into the right hand side of the metadata box at the bottom of the window and click when you see the blue box appear. A blank chart should appear. Click **Chart Type** and select **Bar chart** and change the **X Axis Column** to `Region`.  The plot should auto populate with the region on the X-axis and the Number of entries on the Y-axis.  The bars are coloured according to `main_lineage` which is what we're currently using to colour our plots:
+
+![](images/microreact_hist.png)
+
 ## Summary
 
 ::: {.callout-tip}

diff --git a/materials/13-group_exercise_1.md b/materials/13-group_exercise_1.md
@@ -9,6 +9,17 @@ title: "Group Exercise 1"
 
 :::
 
+Presenting your data to different audiences is an important part of being a scientist and you should be able to tailor your research outputs accordingly. For the first group exercise, we're going to use Microreact to design an infographic that displays the data we've been working with that's tailored to one of the following audiences:
+
+- Field epidemiologist
+- Head of a public health lab
+- Minister of Health
+- Concerned citizen
+
+We're going to gather you into approximately the same sized groups based on where you're sitting in the room and assign you one of the audiences to tailor a Microreact display for.  We suggest that one person in each group is responsible for manipulating the Microreact whilst the other members of the group provide useful input. In 30 minutes we'll ask you to nominate a member of your group to do a two minute presentation of the data, remembering that you also need to tailor your presentation to your audience avoiding jargon where appropriate.
+
+Good luck!
+
 ## Summary
 
 ::: {.callout-tip}

diff --git a/materials/14-transmission.md b/materials/14-transmission.md
@@ -12,23 +12,63 @@ title: "Building transmission networks"
 
 :::
 
-:::{.callout-exercise}
-#### Calculate pairwise SNP distances
+## Transmission networks in bacteria
+
+### Identifying transmission networks in TB
 
+## Generating a pairwise SNP distance matrix
+The first step in building putative transmission networks is to calculate the pairwise SNP distances between all the samples in our dataset and we can do this by running a tool call `pairsnp` on the SNP alignment we used to build our phylogenetic tree.
 
+We'll start by activating the `pairsnp` software environment:
+
+```bash
 mamba activate pairsnp
+```
 
-bash 07-run_pairsnp.sh
+To run `pairsnp` on `aligned_pseudogenomes_masked_snps.fas`, the following commands can be used:
 
-:::
+```bash
+# create output directory
+mkdir -p results/transmission/
+
+# run pairsnp
+pairsnp results/snp-sites/aligned_pseudogenomes_masked_snps.fas -c > results/transmission/aligned_pseudogenomes_masked_snps.csv
+```
+The option we used is:
+
+- `-c` - saves the `pairsnp` output in CSV format.
+
+The pairwise SNP matrix will be saved to the `results/transmission/` directory.  
+
+Alternatively we've provided a script, `07-run_pairsnp.sh` in the `scripts` directory which could be used instead with `bash`:
+
+```bash
+bash scripts/07-run_pairsnp.sh
+```
+
+## Calculating and plotting transmission networks in R
+
+Now that we've generated a pairwise SNP distance matrix, we can use **R** to calculate and plot our transmission network using a pre-determined threshold of **5** SNPs to identify putative transmission events. Open RStudio then open the script `08-transmission.R` in the `scripts` directory. Run the code in the script, going line by line (remember in RStudio you can run code from the script panel using <kbd>Ctrl</kbd> + <kbd>Enter</kbd>). As you run the code check the tables that are created (in your "Environment" panel on the top-right) and see if the SNP matrix was correctly imported.  Once you reach the end of the script, you should have created a plot showing the putative transmission networks identified in the data with the nodes coloured by Sex and the pairwise SNP distances shown along the edges:
+
+![Putative transmission networks generated using a 5 SNP threshold](images/5_snp_network.png)
 
 :::{.callout-exercise}
-#### Calculating and plotting transmission networks
+#### Adjust the pairwise SNP threshold
+As discussed in the introduction above, various SNP thresholds are used when inferring putative transmission networks in TB.  We used the most conservative threshold of 5 SNPs.  For this exercise:
+
+- Change the SNP threshold to 12 SNPs and recalculate the transmission networks
+- Change the colour of the nodes to show Region instead of Sex 
+- How many additional networks did we infer compared to using a threshold of 12 SNPs?
 
+:::{.callout-answer}
+- We changed the variable `threshold` to `12` then re-ran the subsequent code to generate new networks.  
+- In the command to generate the final plot, we changed `geom_node_point(aes(colour = Sex), size = 6)` to `geom_node_point(aes(colour = Region), size = 6)` and changed `labs(colour = "Sex")` to `labs(colour = "Region")`.
+- We generated one additional network but identified a much more complex network comprised of 13 isolates when using the higher SNP threshold of 12.
 
-08-transmission.R
+![Putative transmission networks generated using a 12 SNP threshold](images/12_snp_network.png)
 
 :::
+:::
 
 
 ## Summary

diff --git a/materials/18-de_novo_assembly.md → materials/18-assembly_annotation.md b/materials/18-de_novo_assembly.md → materials/18-assembly_annotation.md
@@ -1,5 +1,5 @@
 ---
-title: "de novo Assembly"
+title: "de novo Assembly and Annotation"
 ---
 
 ::: {.callout-tip}
@@ -19,6 +19,10 @@ There are two approaches for genome assembly: reference-based (or comparative)
 
 Several tools are available for *de novo* genome assembly depending on whether you're trying to assemble short-read sequence data, long reads or else a combination of both.  Two of the most commonly used assemblers for short-read Illumina data are `Velvet` and `SPAdes`.  SPAdes has become the *de facto* standard de novo genome assembler for Illumina whole genome sequencing data of bacteria and is a major improvement over previous assemblers like Velvet. However, some of its components can be slow and it traditionally did not handle overlapping paired-end reads well.  `Shovill` is a pipeline which uses `SPAdes` at its core, but alters the steps before and after the primary assembly step to get similar results in less time. Shovill also supports other assemblers like `SKESA`, `Velvet` and `Megahit`.
 
+## Genome annotation
+
+Genome annotation is a multi-level process that includes prediction of protein-coding genes (CDSs), as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements.  The most commonly used tools for annotating bacterial genomes are `Prokka` and, more recently, `Bakta`.  Both use a tool called `prodigal` to predict the protein-coding regions along with other tools for predicting other genomic features such as `Aragorn` for tRNA.  Once the genomic regions have been predicted, the tools use a database of existing bacterial genome annotations, normally generated from a large collection of genomes such as UniRef, to add this information to your genome.
+
 ## Summary
 
 ::: {.callout-tip}

diff --git a/materials/19-annotation.md b/materials/19-annotation.md
diff --git a/materials/22-pan_genomes.md b/materials/22-pan_genomes.md
@@ -13,6 +13,8 @@ title: "Introduction to Pan-genomes"
 
 When you have a very diverse dataset where no single reference is going to accurately reflect the population structure withn your dataset, then a reference independent approach such as constructing a core gene alignment as part of a pan-genome analysis is the best way to build a multiple sequence alignmnet for phylogenetic inference.  There are several tools available to do this including `roary`, `panaroo` and `panX`.  It's important to note that the alignments produced using these tools only contain the genes found in all or nearly all of the samples meaning that the amount of potentially phylogenetically informative information is reduced.  For this reason, core gene based phylogenies are useful for looking at a whole species but it's generally preferable to perform clustering and create new sub-trees using reference mapping if you're interested in examining the relationship between more closely related genomes.
 
+## Core gene alignment
+
 ## Summary
 
 ::: {.callout-tip}