Skip to content

Commit

Permalink
update 151123
Browse files Browse the repository at this point in the history
  • Loading branch information
avantonder committed Nov 15, 2023
1 parent 572d4f3 commit 98c47b0
Showing 11 changed files with 460 additions and 66 deletions.
6 changes: 6 additions & 0 deletions materials/02-know_your_bug.md
Original file line number Diff line number Diff line change
@@ -13,6 +13,12 @@ Before starting your analysis, it's important to understand the characteristics

![Flowchart showing different approaches to building a bacterial phylogenetic tree](images/know_your_bug.jpeg)

This week, we're going to work with three different bacterial species that each require a different analysis path through the diagram above.

- _Mycobacterium tuberculosis_ - a monoclonal (i.e. relatively little genomic diversity) organism that doesn't recombine so we'll build a phylogenetic tree by mapping to a reference.
- _Staphylococcus aureus_ - a diverse organism with low levels of recombination. We're going to be working with a dataset consisting of more than a single lineage so we'll take the approach of building a core gene alignment to use to infer our phylogenetic tree.
- _Streptococcus pneumoniae_ - a diverse organism that frequently recombines. As the dataset consists of isolates from a single lineage/serotype, we're going to map to a reference and identify and remove recombinant regions before building our phylogenetic tree.

## Summary

::: {.callout-tip}
2 changes: 0 additions & 2 deletions materials/03-nextflow.md
Original file line number Diff line number Diff line change
@@ -100,8 +100,6 @@ Nextflow provides a separation between the pipeline’s functional logic and the

Nextflow provides out-of-the-box support for major batch schedulers and cloud platforms such as Sun Grid Engine, SLURM job scheduler, AWS Batch service and Kubernetes. A full list can be found [here](https://www.nextflow.io/docs/latest/executor.html).

## Nextflow Tower/Seqera

## Snakemake

In this tutorial we've focused on Nextflow but many people in the bioinformatics community use Snakemake. Similar to Nextflow, the Snakemake workflow management system is a tool for creating reproducible and scalable data analyses. The main difference is that workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.
2 changes: 1 addition & 1 deletion materials/05-intro_tb.md
Original file line number Diff line number Diff line change
@@ -21,7 +21,7 @@ Increasingly, *M. tuberculosis* is resistant to many of the frontline antimicrob

## Course dataset

We will be analysing a dataset of XX Namibian *M. tuberculosis* genomes that was recently published (Claasens 2022). The original dataset consisted of 136 drug-resistant TB isolates collected from patients between 2016-2018 across Namibia. For the purposes of this course and for time, we're only going to analyse 50 genomes from the dataset.
We will be analysing a dataset of Namibian *M. tuberculosis* genomes that was recently published (Claasens 2022). The original dataset consisted of 136 drug-resistant TB isolates collected from patients between 2016-2018 across Namibia. For the purposes of this course, mainly to save time, we're only going to analyse 50 genomes from the dataset.

## MTBC ancestral reference sequence

12 changes: 12 additions & 0 deletions materials/07-bacqc.md
Original file line number Diff line number Diff line change
@@ -304,6 +304,18 @@ With respect to potential species contamination, no samples contained more than
:::
:::

:::{.callout-warning}

**The Nextflow work directory**

Each step of the pipeline produces one or more files that are not saved to the results directory but are kept in the work directory. This means that if, for whatever reason, the pipeline doesn't finish successfully you can resume it. However, once the pipeline has completed successfully, you no longer need this directory (it can take up a lot of space) so you can delete it:

```bash
rm -rf work
```

:::

## Summary

::: {.callout-tip}
31 changes: 0 additions & 31 deletions materials/09-bactmap.md
Original file line number Diff line number Diff line change
@@ -161,29 +161,6 @@ This section of the report shows the software run as part of nf-core/bactmap and

![nf-core/bactmap MultiQC software versions](images/bactmap_software_versions.png)

### The pseudogenomes directory

This directory contains the files that are most useful for our downstream analyses. Change to the directory and list the files:

```bash
cd bactmap_results/pseudogenomes

ls
```

You will see the pseudogenome fasta files (a version of the reference file where the sample variants have been inserted and low-quality or missing data has been masked) for each sample, an alignment of all the samples and the reference sequence (`aligned_pseudogenomes.fas`), and a tsv file of genomes removed from the complete alignment due to poor mapping (`low_quality_pseudogenomes.tsv`):

```bash
aligned_pseudogenomes.fas
low_quality_pseudogenomes.tsv
```

Let's check to see if any of our samples were removed from our alignment:

```bash
cat low_quality_pseudogenomes.tsv
```
This returns an empty file; all of our samples mapped well enough to the reference to be included in our complete alignment. That means we can proceed to the next step of our analysis: phylogenetic tree inference.

## Check how much of the reference was mapped

@@ -218,14 +195,6 @@ remove_blocks_from_aln.py -a aligned_pseudogenomes.fas -t ../../../resources/mas

bash scripts/04-mask_pseudogenome.sh
```
### The work directory

Each step of the pipeline produces one or more files that are not saved to the results directory but are kept in the work directory. This means that if, for whatever reason, the pipeline doesn't finish successfully you can resume it. However, once the pipeline has completed successfully, you no longer need this directory (it can take up a lot of space) so you can delete it:

```bash
rm -rf work
```

## Summary

::: {.callout-tip}
4 changes: 4 additions & 0 deletions materials/17-intro_staph.md
Original file line number Diff line number Diff line change
@@ -9,6 +9,10 @@ title: "Introduction to *Staphylococcus aureus*"

:::

## Course dataset

van Tonder 2023

## Summary

::: {.callout-tip}
1 change: 0 additions & 1 deletion materials/18-de_novo_assembly.md
Original file line number Diff line number Diff line change
@@ -6,7 +6,6 @@ title: "de novo Assembly"
## Learning Objectives

- Understand what *de novo* genome assembly is and how it differs from reference-based assembly.
- Assemble sequence data with `assembleBAC`

:::

61 changes: 34 additions & 27 deletions materials/21-assembly_qc.md
Original file line number Diff line number Diff line change
@@ -37,18 +37,17 @@ Let's now turn to some of the other metrics to help us assess our assemblies' qu

## Contiguity

One of the outputs from running assembleBAC is a summary file containing the `Quast` outputs for each sample. This file can be found in `results/assembleBAC/metadata/transposed_report.csv`.
One of the outputs from running assembleBAC is a summary file containing the `Quast` outputs for each sample. This file can be found in `results/assembleBAC/metadata/transposed_report.tsv`.

You can open it with a spreadsheet software such as _Excel_ from our file browser <i class="fa-solid fa-folder"></i>.
Alternatively you can print it directly on the terminal in a nice format using the `column` command, as shown here for the samples we've been using as an example ((for brevity, we only show some columns of most interest):):

```bash
column -t -s "," results/assembleBAC/metadata/transposed_report.csv
```
You can open it with a spreadsheet software such as _Excel_ from our file browser <i class="fa-solid fa-folder"></i> (for brevity, we only show the columns of most interest):

```
Assembly # contigs (>= 0 bp) # contigs (>= 1000 bp) Total length (>= 0 bp) Largest contig N50
CTMA_1432 2 2 4103466 3057934 3057934
ERX3876932_ERR3864879_T1_contigs 77 40 2848635 484893 109559
ERX3876949_ERR3864896_T1_contigs 70 18 2683262 493468 251833
ERX3876930_ERR3864877_T1_contigs 84 20 2729135 557153 251805
ERX3876908_ERR3864855_T1_contigs 45 13 2717933 792936 707553
ERX3876945_ERR3864892_T1_contigs 30 9 2670961 1351816 1351816
```

The columns are:
@@ -61,38 +60,40 @@ The columns are:
- **N50** - a metric indicating the length of the shortest fragment, from the group of fragments that together represent at least 50% of the total genome. A higher N50 value suggests better contig lengths.

To interpret these statistics, it helps to compare them with other well-assembled _Staphylococcus aureus_ genomes.
For example, let's take the first MRSA genome that was sequenced, [XXXX]('NCBI link'), as our reference for comparison.
This genome is X Mb long and is composed of a single chromosome.
For example, let's take the first MRSA genome that was sequenced, [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/), as our reference for comparison.
This genome is 2.8 Mb long and is composed of a single chromosome.

We can see that all of our assemblies reached a total length of around XX to XX Mb, which matches the expected length from our reference genome.
We can see that all of our assemblies reached a total length of around 2.7 to 2.9 Mb, which matches the expected length from our reference genome.
This indicates that we managed to assemble most of the expected genome.
However, we can see that there is a variation in the number of fragments in the final assemblies (i.e. their contiguity).
For instance, isolates X and X were assembled to a small number of fragments each, suggesting good assemblies. For several other isolates our assemblies were more fragmented, in particular X and X, which had hundreds of fragments.
This indicates less contiguous sequences.

## Completeness

We now turn to assessing **genome completeness**, i.e. whether we managed to recover most of the known _Staphylococcus aureus_ genome, or whether we have large fractions missing. We can assess this by using [`CheckM2`](https://github.com/chklovski/CheckM2) which was run as part of the assembleBAC pipeline. This tool assesses the completeness of the assembled genomes based on other similar organisms in public databases, in addition to contamination scores. The output file from assembleBAC summarising the `CheckM2` results is a tab-delimited file called `checkm2_summary.tsv`. This file can be found in `results/assembleBAC/metadata/` and can be opened in a spreadsheet software such as _Excel_. Here is an example result (for brevity, we only show some columns of most interest):
We now turn to assessing **genome completeness**, i.e. whether we managed to recover most of the known _Staphylococcus aureus_ genome, or whether we have large fractions missing. We can assess this by using [`CheckM2`](https://github.com/chklovski/CheckM2) which was run as part of the assembleBAC pipeline. This tool assesses the completeness of the assembled genomes based on other similar organisms in public databases, in addition to contamination scores. The output file from assembleBAC summarising the `CheckM2` results is a tab-delimited file called `checkm2_summary.tsv`. This file can be found in `results/assembleBAC/metadata/` and can be opened in a spreadsheet software such as _Excel_. Here is an example result (for brevity, we only show the columns of most interest):

```
Name Completeness Contamination Genome_Size GC_Content Total_Coding_Sequences
isolate01 92.51 5.19 4192535 0.48 5375
Name Completeness Contamination Genome_Size GC_Content Total_Coding_Sequences
ERX3876905_ERR3864852_T1_contigs 100 0.22 2743298 0.33 2585
ERX3876907_ERR3864854_T1_contigs 100 0.49 2900162 0.33 2765
ERX3876908_ERR3864855_T1_contigs 100 0.07 2717933 0.33 2534
ERX3876909_ERR3864856_T1_contigs 100 0.17 2854371 0.33 2711
ERX3876929_ERR3864876_T1_contigs 100 0.1 2675834 0.33 2464
```
These columns indicate:

- **Name** - our sample name.
- **Completeness** - how complete our genome was inferred to be as a percentage; this is based on the machine learning models used and the organisms present in the database.
- **Contamination** - the percentage of the assembly estimated to be contaminated with other organisms (indicating our assembly isn't "pure").
- **Genome_Size** - how big the genome is estimated to be, based on other similar genomes present in the database.
The [XXXX]('NCBI link') is X Mb in total, so these values make sense.
The [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/) is 2.8 Mb in total, so these values make sense.
- **GC_Content** - the percentage of G's and C's in the genome, which is relatively constant within a species.
The _S. aureus_ GC content is X%, so again these values make sense.
The _S. aureus_ GC content is approximately 33%, so again these values make sense.
- **Total_Coding_Sequences** - the total number of coding sequences (genes) that were identified by _CheckM2_.
The [reference genome]('NCBI link') indicates annotated genes, so the values obtained could be overestimated.

From this analysis, we can assess that our genome assemblies are good quality, with around 90% of the genome assembled.
Despite earlier assessing that some of the isolates have more fragmented assemblies (e.g. isolates 5 and 9), they still have >85% completeness according to _CheckM2_.
The [N315](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000009645.1/) indicates annotated genes, so the values obtained could be overestimated.

From this analysis, we can assess that our genome assemblies are good quality, with 100% of the genome assembled for all our isolates.
It is worth noting that the assessment from _CheckM2_ is an approximation based on other genomes in the database.
In diverse species such as _S. aureus_ the completeness may be underestimated, because each individual strain will only carry part of the _pangenome_ for that species.

@@ -123,14 +124,16 @@ We opened this table in _Excel_ and this is what we obtained:

```
Assembly # contigs (>= 0 bp) # contigs (>= 1000 bp) # contigs (>= 5000 bp) # contigs (>= 10000 bp) # contigs (>= 25000 bp) # contigs (>= 50000 bp) Total length (>= 0 bp) Total length (>= 1000 bp) Total length (>= 5000 bp) Total length (>= 10000 bp) Total length (>= 25000 bp) Total length (>= 50000 bp) # contigs Largest contig Total length GC (%) N50 N90 auN L50 L90 # N's per 100 kbp
CTMA_1432 2 2 2 2 2 2 4103466 4103466 4103466 4103466 4103466 4103466 2 3057934 4103466 47.50 3057934 1045532 2545189.2 1 2 0.00
ERX3876932_ERR3864879_T1_contigs 77 40 36 29 25 20 2848635 2836870 2827630 2778943 2698417 2521114 47 484893 2841571 32.73 109559 41891 191514.9 6 21 0
ERX3876949_ERR3864896_T1_contigs 70 18 16 15 14 11 2683262 2671100 2666432 2659228 2636490 2507149 21 493468 2673387 32.69 251833 78361 283091.3 4 10 0
ERX3876930_ERR3864877_T1_contigs 84 20 20 18 15 11 2729135 2714469 2714469 2701567 2644913 2501830 25 557153 2717749 32.7 251805 52910 319559.1 4 10 0
```

The assembly lengths obtained were all around 2 Mb, which is expected for this species.
The assembly lengths obtained were all around 2.7 to 2.8 Mb, which is expected for this species.

The contiguity of our assemblies seemed excellent, with only up to XX fragments. Some samples even had X fragments assembled, suggesting we managed to mostly reassemble our genomes.
The contiguity of our assemblies seemed excellent, with only up to 205 fragments. Some samples even had fewer than 50 fragments assembled, suggesting we managed to mostly reassemble our genomes.

Isolates XX and XX have more than X contigs. We may want to consider removing these isolates as they may affect our pan-genome calculation in the next section.
Isolate ERX3876939_ERR3864886_T1_contigs has more than 200 contigs. A this is higher than the number obtained for the rest of our assembled genomeWe may want to consider removing this isolate as it may affect our pan-genome calculation in the next section. However, as the rest of the metrics for this genome look alright, let's leave it in for now.

:::
:::
@@ -151,14 +154,18 @@ We opened the file `results/assembleBAC/metadata/checkm2_summary.tsv` in _Excel_

```
Name Completeness Contamination Completeness_Model_Used Translation_Table_Used Coding_Density Contig_N50 Average_Gene_Length Genome_Size GC_Content Total_Coding_Sequences Additional_Notes
CTMA_1402 99.99 1.16 Neural Network (Specific Model) 11 0.865 3058794 310.65996343692865 4118479 0.47 3829 None
ERX3876905_ERR3864852_T1_contigs 100 0.22 Neural Network (Specific Model) 11 0.841 188199 297.9439072 2743298 0.33 2585 None
ERX3876907_ERR3864854_T1_contigs 100 0.49 Neural Network (Specific Model) 11 0.839 174940 293.8538879 2900162 0.33 2765 None
ERX3876908_ERR3864855_T1_contigs 100 0.07 Neural Network (Specific Model) 11 0.841 707553 301.2190213 2717933 0.33 2534 None
ERX3876909_ERR3864856_T1_contigs 100 0.17 Neural Network (Specific Model) 11 0.84 181628 295.3832534 2854371 0.33 2711 None
ERX3876929_ERR3864876_T1_contigs 100 0.1 Neural Network (Specific Model) 11 0.841 605766 304.9176136 2675834 0.33 2464 None
```

We can see from this that:

- We achieved ~90% completeness (according to _CheckM2_'s database) for all our samples.
- We achieved 100% completeness (according to _CheckM2_'s database) for all our samples.
- There was no evidence of strong contamination affecting our assemblies (all ~5% or below).
- The GC content was XX%, which is what is expected for _Staphylococcus aureus_.
- The GC content was 33%, which is what is expected for _Staphylococcus aureus_.
:::
:::

12 changes: 11 additions & 1 deletion materials/29-pathogenwatch.md
Original file line number Diff line number Diff line change
@@ -9,7 +9,17 @@ title: "Pathogenwatch 2"

:::

Assemblies provided in `results/assemblebac/assemblies`
:::{.callout-exercise}
#### Analysing Pneumococcal genomes with Pathogenwatch

Whilst you can upload FASTQ files to Pathogenwatch, it's quicker if we work with already assembled genomes. We've provided pre-processed results for the _S. pneumoniae_ data generated with AssembleBAC in the `preprocessed/` directory which you can use to upload to Pathogenwatch in this exercise.

:::{.callout-answer}


:::



## Summary

Loading

0 comments on commit 98c47b0

Please sign in to comment.