Skip to content

Commit

Permalink
update 21123
Browse files Browse the repository at this point in the history
  • Loading branch information
avantonder committed Nov 21, 2023
1 parent 98c47b0 commit 109fe45
Show file tree
Hide file tree
Showing 49 changed files with 663 additions and 127 deletions.
2 changes: 2 additions & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,4 +54,6 @@ book:
href: index.md
- text: "Data & Setup"
href: setup.md
sidebar:
collapse-level: 1

Binary file modified materials/.DS_Store
Binary file not shown.
10 changes: 5 additions & 5 deletions materials/03-nextflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@ WfMS contain multiple features that simplify the development, monitoring, execut

Key features include;

* **Run time management**: Management of program execution on the operating system and splitting tasks and data to run at the same time in a process called parallelisation.
* **Software management**: Use of technology like containers, such as [Docker](https://www.docker.com) or [Singularity](https://sylabs.io/singularity), that packages up code and all its dependencies so the application runs reliably from one computing environment to another.
* **Portability & Interoperability**: Workflows written on one system can be run on another computing infrastructure e.g., local computer, compute cluster, or cloud infrastructure.
* **Reproducibility**: The use of software management systems and a pipeline specification means that the workflow will produce the same results when re-run, including on different computing platforms.
* **Reentrancy**: Continuous checkpoints allow workflows to resume
- **Run time management**: Management of program execution on the operating system and splitting tasks and data to run at the same time in a process called parallelisation.
- **Software management**: Use of technology like containers, such as [Docker](https://www.docker.com) or [Singularity](https://sylabs.io/singularity), that packages up code and all its dependencies so the application runs reliably from one computing environment to another.
- **Portability & Interoperability**: Workflows written on one system can be run on another computing infrastructure e.g., local computer, compute cluster, or cloud infrastructure.
- **Reproducibility**: The use of software management systems and a pipeline specification means that the workflow will produce the same results when re-run, including on different computing platforms.
- **Reentrancy**: Continuous checkpoints allow workflows to resume
from the last successfully executed steps.

## Nextflow basic concepts
Expand Down
10 changes: 1 addition & 9 deletions materials/04-nf_core.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@ title: "The nf-core project"

- Understand what nf-core is and how it relates to Nextflow.
- Use the nf-core helper tool to find nf-core pipelines.
- Understand how to configuration nf-core pipelines.
- Run a small nf-core pipeline using a test dataset.
- Understand how to configure nf-core pipelines.

:::

Expand Down Expand Up @@ -114,13 +113,6 @@ nextflow run nf-core/rnaseq -r 3.0 -profile <institutional_config_profile>, test
### Multiple Nextflow configuration locations
Be clever with multiple Nextflow configuration locations. For example, use `-profile` for your cluster configuration, the file `$HOME/.nextflow/config` for your personal config such as `params.email` and a working directory >`nextflow.config` file for reproducible run-specific configuration.

:::{.callout-exercise}
**Note: As we have already prepared the config file, you can skip this exercise.**

Add the `params.email` to a file called `nfcore-custom.config`

:::

### Running pipelines with test data

The nf-core config profile `test` is special profile, which defines a minimal data set and configuration, that runs quickly and tests the workflow from beginning to end. Since the data is minimal, the output is often nonsense. Real world example output are instead linked on the nf-core pipeline web page, where the workflow has been run with a full size data set:
Expand Down
18 changes: 9 additions & 9 deletions materials/05-intro_tb.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Introduction to *Mycobacterium tuberculosis*"
title: "Introduction to _Mycobacterium tuberculosis_"
---

::: {.callout-tip}
Expand All @@ -11,21 +11,21 @@ title: "Introduction to *Mycobacterium tuberculosis*"

## *Mycobacterium tuberculosis*

*Mycobacterium tuberculosis*, the bacterium that causes tuberculosis (TB) in humans, is a significant pathogen with a considerable global impact. In 2020, the World Health Organization estimated that TB was responsible for 10.6 million active cases and 1.6 million deaths across the globe. This means that *M .tuberculosis* causes greater mortality than any other single pathogen. *M. tuberculosis* is a small, aerobic, nonmotile bacillus. The high lipid content of its cell wall makes the cell impervious to Gram staining, so it is classified as an acid-fast bacillus. The bacterium is able to survive and multiply within macrophages, which are cells that usually kill bacteria. This ability to evade the immune system contributes to its virulence. *M. tuberculosis* is transmitted from person to person via droplets from the throat and lungs of people with active respiratory disease. In healthy individuals, the immune system can often wall off the bacteria and prevent them from spreading within the body. However, in immunocompromised individuals, such as those with HIV, the bacteria can spread and cause active disease.
*Mycobacterium tuberculosis*, the bacterium that causes tuberculosis (TB) in humans, is a significant pathogen with a considerable global impact. In 2020, the World Health Organization estimated that TB was responsible for 10.6 million active cases and 1.6 million deaths across the globe. This means that _M .tuberculosis_ causes greater mortality than any other single pathogen. _M. tuberculosis_ is a small, aerobic, nonmotile bacillus. The high lipid content of its cell wall makes the cell impervious to Gram staining, so it is classified as an acid-fast bacillus. The bacterium is able to survive and multiply within macrophages, which are cells that usually kill bacteria. This ability to evade the immune system contributes to its virulence. _M. tuberculosis_ is transmitted from person to person via droplets from the throat and lungs of people with active respiratory disease. In healthy individuals, the immune system can often wall off the bacteria and prevent them from spreading within the body. However, in immunocompromised individuals, such as those with HIV, the bacteria can spread and cause active disease.

In 1998, the first complete genome sequence of a *M. tuberculosis* strain, the virulent laboratory reference strain H37Rv, was published (Cole 1998).The genome of *M. tuberculosis* is a single circular chromosome that is approximately 4.4 million base pairs in size and contains around 4000 genes. *M. tuberculosis* is a member of the Mycobacterium tuberculosis complex (MTBC), which includes different lineages, some referred to as *M. tuberculosis sensu stricto* (lineage 1 to lineage 4 and lineage 7), others as *M. africanum* (lineage 5 and lineage 6), two recently discovered lineages (lineage 8 and lineage 9), and several animal-associated ecotypes such as *M. bovis* and *M. caprae*. Some lineages are geographically widespread, while others like L5 and L6 (mainly found in West Africa), are more restricted. A simplified phylogeny showing the relationship of the various MTBC lineages is shown below.
In 1998, the first complete genome sequence of a _M. tuberculosis_ strain, the virulent laboratory reference strain H37Rv, was published (Cole 1998).The genome of _M. tuberculosis_ is a single circular chromosome that is approximately 4.4 million base pairs in size and contains around 4000 genes. _M. tuberculosis_ is a member of the Mycobacterium tuberculosis complex (MTBC), which includes different lineages, some referred to as _M. tuberculosis sensu stricto_ (lineage 1 to lineage 4 and lineage 7), others as _M. africanum_ (lineage 5 and lineage 6), two recently discovered lineages (lineage 8 and lineage 9), and several animal-associated ecotypes such as _M. bovis_ and _M. caprae_. Some lineages are geographically widespread, while others like L5 and L6 (mainly found in West Africa), are more restricted. A simplified phylogeny showing the relationship of the various MTBC lineages is shown below.

![Phylogeny of M. tuberculosis lineage strains. Simplified maximum likelihood phylogeny of the 9 lineages of M. tuberculosis, as well as the related M. bovis strain and the M. canettii outgroup strain used as a root. (Coscolla 2021; Koleske 2023)](images/mtbc.jpg)

Increasingly, *M. tuberculosis* is resistant to many of the frontline antimicrobials used to treat TB such as isoniazid and rifampicin which is an enormous clinical, financial, and public health challenge across the world. Traditionally, susceptibility of TB isolates to different antimicrobials was conducted in the laboratory but in recent years, antimicrobial profiling using genomic sequencing has been shown to be nearly as accurate as lab methods, especially for the most commonly used drugs. Catalogues of genetic variants that are known to confer resistance to particular antimicrobials are used to type TB genomes potentially saving time and money.
Increasingly, _M. tuberculosis_ is resistant to many of the frontline antimicrobials used to treat TB such as isoniazid and rifampicin which is an enormous clinical, financial, and public health challenge across the world. Traditionally, susceptibility of TB isolates to different antimicrobials was conducted in the laboratory but in recent years, antimicrobial profiling using genomic sequencing has been shown to be nearly as accurate as lab methods, especially for the most commonly used drugs. Catalogues of genetic variants that are known to confer resistance to particular antimicrobials are used to type TB genomes potentially saving time and money.

## Course dataset

We will be analysing a dataset of Namibian *M. tuberculosis* genomes that was recently published (Claasens 2022). The original dataset consisted of 136 drug-resistant TB isolates collected from patients between 2016-2018 across Namibia. For the purposes of this course, mainly to save time, we're only going to analyse 50 genomes from the dataset.
We will be analysing a dataset of Namibian _M. tuberculosis_ genomes that was recently published (Claasens 2022). The original dataset consisted of 136 drug-resistant TB isolates collected from patients between 2016-2018 across Namibia. For the purposes of this course, mainly to save time, we're only going to analyse 50 genomes from the dataset.

## MTBC ancestral reference sequence

The most widely used reference genomes when doing reference-based alignment of MTBC short reads are the H37Rv type strain originally sequenced in 1998 and the putative MTBC ancestral sequence that was inferred by Comas *et al.* in 2013. As both of these sequences were based on lineage 4 sequences, they do not capture the complete structural variation likely to be found in the MTBC. To improve this ancestral sequence, Harrison *et al.* compared closed (i.e. complete with no gaps) genomes from across the MTBC and inferred a new MTBC ancestral sequence, MTBC<sub>0</sub> (Harrison 2023). This is the reference sequence we'll use this week.
The most widely used reference genomes when doing reference-based alignment of MTBC short reads are the H37Rv type strain originally sequenced in 1998 and the putative MTBC ancestral sequence that was inferred by Comas _et al._ in 2013. As both of these sequences were based on lineage 4 sequences, they do not capture the complete structural variation likely to be found in the MTBC. To improve this ancestral sequence, Harrison _et al._ compared closed (i.e. complete with no gaps) genomes from across the MTBC and inferred a new MTBC ancestral sequence, MTBC<sub>0</sub> (Harrison 2023). This is the reference sequence we'll use this week.

## Summary

Expand All @@ -36,11 +36,11 @@ The most widely used reference genomes when doing reference-based alignment of M

#### References

Claasens M, *et al.* Whole-Genome Sequencing for Resistance Prediction and Transmission Analysis of *Mycobacterium tuberculosis* Complex Strains from Namibia. *Microbiology Spectrum*. 2022. [DOI](https://doi.org/10.1128/spectrum.01586-22)
Claasens M, _et al._ Whole-Genome Sequencing for Resistance Prediction and Transmission Analysis of _Mycobacterium tuberculosis_ Complex Strains from Namibia. _Microbiology Spectrum_. 2022. [DOI](https://doi.org/10.1128/spectrum.01586-22)

Cole ST, *et al.* Deciphering the biology of *Mycobacterium tuberculosis* from the complete genome sequence. *Nature*. 1998. [DOI](https://doi.org/10.1038/31159)
Cole ST, _et al._ Deciphering the biology of _Mycobacterium tuberculosis_ from the complete genome sequence. _Nature_. 1998. [DOI](https://doi.org/10.1038/31159)

Harrison L, *et al.* An imputed ancestral reference genome for the *Mycobacterium tuberculosis* complex better captures structural genomic diversity for reference-based alignment workflows. *bioRxiv*. 2023. [DOI](https://doi.org/10.1101/2023.09.07.556366)
Harrison L, _et al._ An imputed ancestral reference genome for the _Mycobacterium tuberculosis_ complex better captures structural genomic diversity for reference-based alignment workflows. _bioRxiv_. 2023. [DOI](https://doi.org/10.1101/2023.09.07.556366)

World Health Organization. Global Tuberculosis Report 2021. Geneva: World Health Organization; 2021. [Link](https://www.who.int/publications/i/item/9789240037021)

Expand Down
28 changes: 10 additions & 18 deletions materials/06-intro_qc.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,19 @@ title: "Introduction to QC"

Before we delve into having a look at our own genomic data. Lets take time to explore what to look out for when performing **Q**uality **C**ontrol **(QC)** checks on our sequence data.
For this course, we will largely focus on next generation sequences obtained from Illumina sequencers.
As you may already know from [Introduction to NGS](01-intro_ngs.md), the main output files expected from our Illumina sequencer are `.fastq` files.
As you may already know from [Introduction to NGS](01-intro_ngs.md), the main output files expected from our Illumina sequencer are FASTQ files.

## QC assessment of NGS data

As you may already know, **QC** is an important part of any analysis. In this section we are going to look at some of the metrics and graphs that can be used to assess the QC of NGS data.
**QC** is an important part of any analysis and, in this section, we're going to look at some of the metrics and graphs that can be used to assess the QC of NGS data.

### Base quality

[Illumina sequencing](https://en.wikipedia.org/wiki/Illumina_dye_sequencing) technology relies on sequencing by synthesis. One of the most common problems with this is __dephasing__. For each sequencing cycle, there is a possibility that the replication machinery slips and either incorporates more than one nucleotide or perhaps misses to incorporate one at all. The more cycles that are run (i.e. the longer the read length gets), the greater the accumulation of these types of errors gets. This leads to a heterogeneous population in the cluster, and a decreased signal purity, which in turn reduces the precision of the base calling. The figure below shows an example of this.

![Base Quality](images/base_qual.png)

Because of dephasing, it is possible to have high-quality data at the beginning of the read but really low-quality data towards the end of the read. In those cases you can decide to trim off the low-quality reads. In this course, we'll do this using the tool [`fastp`](https://www.ncbi.nlm.nih.gov/pubmed/30423086/). In addition to trimming and removing low quality reads, `fastp` will also be used to trim off Illumina adapter/primer sequences.
Because of dephasing, it is possible to have high-quality data at the beginning of the read but really low-quality data towards the end of the read. In those cases you can decide to trim off the low-quality reads. In this course, we'll do this using the tool [fastp](https://www.ncbi.nlm.nih.gov/pubmed/30423086/). In addition to trimming and removing low quality reads, `fastp` will also be used to trim Illumina adapter/primer sequences.

The figures below show an example of high-quality read data (left) and poor quality read data (right).

Expand All @@ -39,12 +39,12 @@ The figures below show an example of high-quality read data (left) and poor qual

:::

In addition to __Phasing noise__ and __signal decay__ resulting from dephasing issues described above, there are several different reasons for a base to be called incorrectly. You can lookup these later by clicking [here](10.1093/bib/bbq077).
In addition to __Phasing noise__ and __signal decay__ resulting from dephasing issues described above, there are several different reasons for a base to be called incorrectly. You can lookup these later by clicking [here](https://doi.org/10.1093/bib/bbq077).


### Mismatches per cycle

Aligning reads to a high-quality reference genome can provide insight to the quality of a sequencing run by showing you the mismatches to the reference sequence. This can help you detect cycle-specific errors. Mismatches can occur due to two main causes, sequencing errors and differences between your sample and the reference genome, which is important to bear in mind when interpreting mismatch graphs. The figures below show an example of a good run (top) and a bad one (bottom). In the first figure, the distribution of the number of mismatches is even between the cycles, which is what we would expect from a good run. However, in the second figure, two cycles stand out with a lot of mismatches compared to the other cycles.
Aligning reads to a high-quality reference genome can provide insights into the quality of a sequencing run by showing you the mismatches to the reference sequence. In particular, this can help you detect cycle-specific errors. Mismatches can occur due to two main causes: sequencing errors and differences between your sample and the reference genome; this is important to bear in mind when interpreting mismatch graphs. The figures below show an example of a good run (top) and a bad one (bottom). In the first figure, the distribution of the number of mismatches is even between the cycles, which is what we would expect from a good run. However, in the second figure, two cycles stand out with a lot of mismatches compared to the other cycles.

![Good run](images/mismatch_per_cycle_pass.png)

Expand All @@ -64,7 +64,7 @@ It is a good idea to compare the GC content of the reads against the expected di
:::

### GC content by cycle
Looking at the GC content per cycle can help detect if the adapter sequence was trimmed. For a random library, it is expected to be little to no difference between the different bases of a sequence run, so the lines in this plot should be parallel with each other like in the first of the two figures below. In the second of the figures, the initial spikes are likely due to adapter sequences that have not been removed.
Looking at the GC content per cycle can help detect if the adapter sequence was trimmed. For a random library, there is expected to be little to no difference between the different bases of a sequence run, so the lines in this plot should be parallel with each other like in the first of the two figures below. In the second of the figures, the initial spikes are likely due to adapter sequences that have not been removed.

![Good run](images/acgt_per_cycle_pass.png)

Expand All @@ -80,13 +80,15 @@ For paired-end sequencing the size of DNA fragments also matters. In the first o


### Insertions/Deletions per cycle
Sometimes, air bubbles occur in the flow cell, which can manifest as false indels. The spike in the second image provides an example of how this can look.
Sometimes, air bubbles occur in the flow cell, and this can manifest as false indels. The spike in the second image provides an example of how this can look.

![Good run](images/indels-per-cycle.pass.png)

![Poor run](images/indels-per-cycle.fail.png)


## Assessment of species composition

## Summary

::: {.callout-tip}
Expand All @@ -99,14 +101,4 @@ Information on this page has been adapted and modified from the following source

- https://github.com/sanger-pathogens/QC-training

- https://github.com/rpetit3/fastq-scan

- https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

- https://github.com/OpenGene/fastp

- https://github.com/DerrickWood/kraken2

- https://github.com/jenniferlu717/Bracken

- https://github.com/ewels/MultiQC
- https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Loading

0 comments on commit 109fe45

Please sign in to comment.