restructuring

cambiotraining · Oct 28, 2024 · ab8b450 · ab8b450
1 parent 5afa990
commit ab8b450
Show file tree

Hide file tree

Showing 10 changed files with 249 additions and 116 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/materials/21-pres.md b/materials/21-pres.md
@@ -1,5 +1,5 @@
 ---
-title: Presentations
+title: QC and alignment based metagenomics (theory)
 ---
 
 ## Data formats, QC and data management

diff --git a/materials/22-pract.md b/materials/22-pract.md
@@ -1,5 +1,5 @@
 ---
-title: Practical
+title: QC and targeted alignment methods
 ---
 
 ## Finding a known genome in mixed microbial community
@@ -242,7 +242,7 @@ BAM files can be used for various other purposes:
 - Manipulating data in R by using the [Rsamtools](https://bioconductor.org/packages/release/bioc/html/Rsamtools.html) and [GenomicAlignments](https://bioconductor.org/packages/release/bioc/html/GenomicAlignments.html) packages
 :::
 
-## Compositional mapping of shutgun metagenomics data
+## Compositional mapping of shotgun metagenomics data
 
 The previously used whole-genome alignment based mapping method works very well if we know what we are looking for, e.g., we try to detect a known pathogen or follow the dynamic changes in a synthetic community. When we try to describe the composition of more complex communities the potential reference database would be so big using whole genomes (Bacteria + Archea + Fungi + Virus), so we apply different reduction strategies. The two main approaches are:(i) Finding biologically meaningful genetic signatures (e.g., genes that are only present in certain taxonomic group) and search for those in the raw data; (ii) extract numerous random short fragments from genomes and try to find exact matchings for those in the raw data.
 

diff --git a/materials/23-pract.md b/materials/23-pract.md
@@ -0,0 +1,112 @@
+---
+title: Finding an unknown genome
+---
+
+## Discover an unknown genome in low complexity mixed community
+
+In this practical we are simulating a virus infection that is caused by a completely unknown virus. In this simulated dataset a novel genome is hidden in the raw data with a huge background of human genome sequence. This simulates a real scenario when a pathogen is in the blood or other (otherwise sterile) body fluid. To make the bioinformatics step faster we generated the human "background" from chromosome 22, so the database will be relatively small.
+
+:::{.callout-important}
+#### Activate your software environment
+
+For this practical we need to activate the software environment called `assembly`:
+
+```bash
+mamba activate assembly
+```
+:::
+
+
+### QC and Pre-processing
+
+The raw date quality control and pre-processing is going the same way as we did with the mixed community data, for the details on these steps, please refer to [the appropriate practical material](22-pract.html#standard-quality-control-and-pre-processing-of-shotgun-metagenomics-raw-data).
+
+```bash
+cd sg_raw_data/
+fastqc unknown_pathogen_R1.fastq unknown_pathogen_R2.fastq
+
+cutadapt -a CTGTCTCTTATACACATCT -A ATGTGTATAAGAGACA \
+-o unknown_pathogen_noadapt_R1.fastq -p unknown_pathogen_noadapt_R2.fastq \
+unknown_pathogen_R1.fastq unknown_pathogen_R2.fastq
+
+trimmomatic PE -phred33 \
+unknown_pathogen_noadapt_R1.fastq unknown_pathogen_noadapt_R2.fastq \
+unknown_forward_paired.fq.gz unknown_forward_unpaired.fq.gz \
+unknown_reverse_paired.fq.gz unknown_reverse_unpaired.fq.gz \
+LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
+```
+
+### Aligning reads to human chromosome 22
+
+First let's try to align our raw data to the known genome content (in this case the human chromosome 22 sequence). This is again going similarly to our previous practical, first we create a Bowtie2 database, then perform the alignment.
+
+```bash
+cd sg_reference/
+bowtie2-build -q Homo_sapiens.GRCh38.dna.chromosome.22.fa human_chr22
+
+cd ..
+
+bowtie2 --qc-filter -p 8 --local -x sg_reference/human_chr22 \
+-1 sg_raw_data/unknown_forward_paired.fq.gz \
+-2 sg_raw_data/unknown_reverse_paired.fq.gz \
+-S results/unknown_pathogen.sam
+
+# Inspect the result .sam file
+less results/unknown_pathogen.sam
+```
+
+If we look at the alignment result, we can see the aligning short reads to the human chromosome 22 reference sequence. Reads that are not aligning to the reference genome are also in the SAM file, but in this way not really in a usable format.
+
+::: {.callout-exercise}
+#### Extract reads that are not aligning to the human chromosome 22 for further processing
+{{< level 3 >}}
+Refer to the command line help (`bowtie2 -h`) to read about the numerous options the application provides. Re-run the bowtie2 alignment step with the following modifications:
+- use all output fasta files from the `trimmomatic` output as unpaired reads
+- Find and use the option that prints out short reads that are **not aligning** to the reference genome
+- As we don't need the alignment SAM file, find a way to "throw out" the SAM file during the alignment
+- name the output FASTQ file as `unknown_enriched.fastq`
+
+::: {.callout-answer}
+```bash
+bowtie2 --qc-filter -p 8 --local \
+-x sg_reference/human_chr22 \
+-U sg_raw_data/unknown_forward_paired.fq.gz,sg_raw_data/unknown_reverse_paired.fq.gz,sg_raw_data/unknown_forward_unpaired.fq.gz,sg_raw_data/unknown_reverse_unpaired.fq.gz \
+--un sg_raw_data/unknown_enriched.fastq > /dev/null
+```
+
+Alternatively, you can concatenate all input FASTQ files into one input file to avoid the long comma separated list.
+
+```bash
+cat sg_raw_data/unknown_forward_paired.fq.gz \
+sg_raw_data/unknown_reverse_paired.fq.gz \
+sg_raw_data/unknown_forward_unpaired.fq.gz \
+sg_raw_data/unknown_reverse_unpaired.fq.gz > unknown_unpaired_library.fastq
+
+bowtie2 --qc-filter -p 8 --local \
+-x sg_reference/human_chr22 \
+-U unknown_unpaired_library.fastq \
+--un sg_raw_data/unknown_enriched.fastq > /dev/null
+```
+:::
+:::
+
+### Assemble the unknown genome
+
+After eliminating the significant amount of human genome background from our raw data, we still only have the short next-generation sequencing reads for our unknown genome. In the next step we will use one of the most popular genome assembler [SPAdes](https://github.com/ablab/spades), the application that can be used for both single genome and multiple genomes (metagenome) assembly. After looking at the command line help, we will execute the `spades.py` script with very basic settings (input reads and output folder).
+
+```bash
+spades.py
+
+# Run the assembly using all the default settings, only giving the input raw data file and the output folder
+spades.py -s sg_raw_data/unknown_enriched.fastq -o results/unknown_genome
+
+# The assembly results will be in the results/unknown_genome/ folder, check the output files, logs, warnings
+cd results/unknown_genome/
+
+```
+
+:::{.callout-exercise}
+#### Investigate the output of the assembly step
+
+Go through the output files of the `spades.py` script together with the trainers. Look into the log files, discuss the files that are worth to keep long term. Investigate the level of fragmentation of the assembled novel genome.
+:::
diff --git a/materials/24-pract.md b/materials/24-pract.md
@@ -0,0 +1,128 @@
+---
+title: Alignment based compositional profiling
+---
+
+## Compositional mapping of shotgun metagenomics data
+
+The previously used whole-genome alignment based mapping method works very well if we know what we are looking for, e.g., we try to detect a known pathogen or follow the dynamic changes in a synthetic community. When we try to describe the composition of more complex communities the potential reference database would be so big using whole genomes (Bacteria + Archea + Fungi + Virus), so we apply different reduction strategies. The two main approaches are:(i) Finding biologically meaningful genetic signatures (e.g., genes that are only present in certain taxonomic group) and search for those in the raw data; (ii) extract numerous random short fragments from genomes and try to find exact matchings for those in the raw data.
+
+### MetaPhlAn, a marker gene based profiling method
+
+[MetaPhlAn](https://github.com/biobakery/MetaPhlAn/wiki/MetaPhlAn-4) is using a large set of marker genes (~5.1 million) to distinguish between different taxonomic groups (clades). The method can reliably identify genomes down to the species level, further resolution (down to strain level) is possible with the developer's other algorithm [StrainPhlAn](https://github.com/biobakery/MetaPhlAn/wiki/StrainPhlAn-4.1). The application uses its own database of selected genes, and due to the large amount of information it uses, it requires a relatively powerful computer to run (at least 6-8 CPU cores and 32GB RAM is highly recommended).The installation and usage is very well documented on the GitHub site (link above).
+
+We will do the basic profiling on our pre-processed raw data from the synthetic community and will inspect the output results.
+
+```bash
+metaphlan sg_raw_data/mixedcomm_forward_paired.fq.gz,sg_raw_data/mixedcomm_reverse_paired.fq.gz \
+--bowtie2out results/metaphlan.bowtie2.bz2 --nproc 5 --input_type fastq -o results/profiled_metagenome.txt
+
+less results/profiled_metagenome.txt
+
+# let's save the species level taxonomic and abundance information for future comparison
+grep "s__" results/profiled_metagenome.txt | grep -v "t__" > results/profiled_metagenome_sp_only.txt
+
+```
+
+### MASH, a random k-mer based profiling method
+
+There are several bioinformatics applications based on the extraction of multiple k-mers from genomes and then use those to detect the presence of these genomes in raw sequencing or assembled data from mixed microbial communities. The two most popular methods are [MASH](https://mash.readthedocs.io/en/latest/) and [Kraken2](https://ccb.jhu.edu/software/kraken2/). K-mers are short (typically 14-30 nucleotides) DNA fragment that are randomly cut out from genomes. The number of k-mer per genome is between a few hundred and a few thousand depending on the required final sensitivity and specificity. If we take an average setting, where we extract 1000 random 21-mer fragments from an average bacterial genome (~5 million basepairs in length), we only cover less than 5% of the genome. While this is a significant reduction in term of information content, it can still be enough to identify species with good sensitivity and specificity. In the following practical, we will demonstrate the usage of `mash`, we will create our own databases of random k-mers (with different length and number of k-mers) and will also use some generic databases (originally created by using the NCBI RefSeq database).
+
+First, let's check the different functions of `mash`, look at their usage and inspect the RefSeq databases.
+
+```bash
+# Exit from the metagenomics environment and go back to the base
+conda deactivate
+
+mash
+
+mash info sg_reference/RefSeqSketches.msh | head -n 20
+mash info sg_reference/RefSeqSketchesDefaults.msh | head -n 20
+
+mash sketch -h
+```
+
+We will create a new sketch file from the genomes we used to build up our synthetic community. While this sketch database will have very limited usage, we use this small set of genomes to demonstrate the process of sketching (random k-mer extraction). Doing the same process on meaningful (and potentially very large) databases would take much more time. First we examine the input data, then create a small and 'light-weight' database. 
+
+```bash
+
+cd sg_reference/
+
+less mixed_bacterial_community_ncbi_genomes.fasta
+grep ">" mixed_bacterial_community_ncbi_genomes.fasta
+grep ">" mixed_bacterial_community_ncbi_genomes.fasta | wc -l
+
+mash sketch -i -s 500 -k 16 -o mixed_community mixed_bacterial_community_ncbi_genomes.fasta
+
+# Compare the original fasta file size with the sketch size
+ls -lh
+
+mash info mixed_community.msh
+```
+
+::: {.callout-exercise}
+#### Create a high resolution sketch database from the same input data
+{{< level 2 >}}
+Modifying the options of `mash sketch` by extracting __5000 k-mers__ with the __size of 21__ nucleotides, and save it in an output file named `mixed_community_hr` (referring to high resolution). Compare the size of the two different sketch databases and the original FASTA file, print out the basic information about the new Sketch database. Refer to `mash sketch -h` if you need information on the different options.
+
+::: {.callout-answer}
+```bash
+mash sketch -i -s 5000 -k 21 -o mixed_community_hr mixed_bacterial_community_ncbi_genomes.fasta
+
+ls -lh
+
+mash info mixed_community_hr.msh
+```
+:::
+:::
+
+In the next part, you will learn how to profile a mixed community using `mash screen`. After looking at the general help for the command, you will test a scenario when you have either a full genome or just fragments of it. The second reference sequence (`NZ_CP038419_1.fasta`) you will screen with `mash` is not the full genome of the bacteria but only the collection of its genes. While bacterial genomes have low amount of non-coding DNA, the genome is still highly fragmented if each gene is separated to individual fasta records. Not exactly the same way but ending up with highly fragmented genomes in metagenomics studies is very common (especially in de novo assembly). It can happen, that the randomly extracted k-mer in the original genome was in a region that is split into two in the fragmenteg genome. In this case the search algorithm will not find a good match for that k-mer.
+
+:::{.callout-note}
+The output columns for `mash screen` are the following:
+
+- identity: level of similarity between the query and the database reference sequence
+- shared-hashes: number of matching hashes (k-mers) between the query and the database reference sequence
+- median-multiplicity
+- p-value of false detection
+- query-ID of the database entry
+- query-comment of the database entry
+:::
+
+```bash
+mash screen -h
+
+less NZ_CP034931.fa
+grep ">" NZ_CP034931.fa
+
+less NZ_CP038419_1.fasta
+grep -c ">" _NZ_CP038419_1.fasta
+
+mash screen mixed_community.msh NZ_CP034931.fa
+mash screen mixed_community.msh NZ_CP038419_1.fasta
+```
+
+Notice, that in both cases not only the query genome came up with high number of shared hashes. This is caused by high similarity between certain genomes but can be avoided by using the `-w` option. Please read more about this option in the application's help `mash screen -h` or on the [application's website](https://mash.readthedocs.io/en/latest/tutorials.html#screening-a-read-set-for-containment-of-refseq-genomes). Let's try running the same commands with the `-w` option and compare the results
+
+```bash
+mash screen -w mixed_community.msh NZ_CP034931.fa
+mash screen mixed_community.msh NZ_CP038419_1.fasta
+
+mash screen -w mixed_community.msh NZ_CP038419_1.fasta
+mash screen mixed_community_hr.msh NZ_CP038419_1.fasta
+```
+In our previous examples, we used a database that was generated from the genomes we mixed together in the synthetic community against a single genome that we knew was in the synthetic community. This is rarely the case in real life settings, let's try to screen our raw sequencing data with both our own database and the general RefSeq databases. For the bigger databases we can also try to speed up the process by using multiple CPU cores. Please note that we only print out the 50 best (based on identity level) hist from the big database screens, otherwise the terminal would be flooded with the results.
+
+```bash
+mash screen -w mixed_community_hr.msh \
+../sg_raw_data/mixedcomm_forward_paired.fq.gz \
+../sg_raw_data/mixedcomm_reverse_paired.fq.gz
+
+mash screen -w RefSeqSketchesDefaults.msh \
+../sg_raw_data/mixedcomm_forward_paired.fq.gz \
+../sg_raw_data/mixedcomm_reverse_paired.fq.gz | sort -gr -k 1 | head -n 50
+
+mash screen -w RefSeqSketches.msh \
+../sg_raw_data/mixedcomm_forward_paired.fq.gz \
+../sg_raw_data/mixedcomm_reverse_paired.fq.gz | sort -gr -k 1 | head -n 50
+
+```
diff --git a/materials/31-pres.md b/materials/31-pres.md
@@ -1,5 +1,5 @@
 ---
-title: Presentations
+title: De novo and Hi-C metagenomics (theory)
 ---
 ## De novo metagenomics and Hi-C metagenomics