Skip to content

Commit

Permalink
Fix minor issues in mapping tutorial (galaxyproject#1638)
Browse files Browse the repository at this point in the history
Fix minor issues in mapping tutorial
  • Loading branch information
bebatut authored Oct 24, 2019
2 parents b4e8a67 + 79b0c7a commit 072f729
Show file tree
Hide file tree
Showing 4 changed files with 24 additions and 24 deletions.
10 changes: 5 additions & 5 deletions topics/sequence-analysis/tutorials/mapping/bam_explanation.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,23 @@ A BAM ([Binary Alignment Map](https://en.wikipedia.org/wiki/SAM_(file_format)))
>
{: .hands_on}

A BAM file (or a SAM file, the non compressed version) consists of:
A BAM file (or a SAM file, the non-compressed version) consists of:

- A header section (the lines starting with `@`) containing metadata, in particular the chromosome names and lengths (lines starting with the `@SQ` symbol)
- A header section (the lines starting with `@`) containing metadata particularly the chromosome names and lengths (lines starting with the `@SQ` symbol)
- An alignment section consisting of a table with 11 mandatory fields, as well as a variable number of optional fields:

Col | Field | Type | Brief Description
--- | --- | --- | ---
1 | QNAME | String | Query template NAME
2 | FLAG | Integer | bitwise FLAG
2 | FLAG | Integer | Bitwise FLAG
3 | RNAME | String | References sequence NAME
4 | POS | Integer | 1- based leftmost mapping POSition
5 | MAPQ | Integer | MAPping Quality
6 | CIGAR | String | CIGAR String
7 | RNEXT | String | Ref. name of the mate/next read
8 | PNEXT | Integer | Position of the mate/next read
9 | TLEN | Integer | observed Template LENgth
10 | SEQ | String | segment SEQuence
9 | TLEN | Integer | Observed Template LENgth
10 | SEQ | String | Segment SEQuence
11 | QUAL | String | ASCII of Phred-scaled base QUALity+33

> ### {% icon question %} Questions
Expand Down
4 changes: 2 additions & 2 deletions topics/sequence-analysis/tutorials/mapping/igv.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,12 @@ The reads have a direction: they are mapped to the forward or reverse strand, re
>
> > ### {% icon solution %} Solution
> > 1. If a nucleotide differs from the reference sequence in more than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base.
> > 2. They have a mapping quality equal to zero. Interpretation of this mapping quality depends on the mapping aligner as some commonly used aligners use this convention to mark a read with multiple alignments. In such a case, the read also maps to another location with equally good placement. It is also possible the read could not be uniquely placed but the other placements do not necessarily give equally good quality hits.
> > 2. They have a mapping quality equal to zero. Interpretation of this mapping quality depends on the mapping aligner as some commonly used aligners use this convention to mark a read with multiple alignments. In such a case, the read also maps to another location with equally good placement. It is also possible that the read could not be uniquely placed but the other placements do not necessarily give equally good quality hits.
> {: .solution }
{: .question}

> ### {% icon comment %} Tips for IGV
> 1. Because the number of reads over a region can be quite large, the IGV browser by default only allows to see the reads that fall into a small window. This behaviour can be changed in the IGV from `view > Preferences > Alignments`.
> 1. Because the number of reads over a region can be quite large, the IGV browser by default only displays the reads that fall into a small window. This behaviour can be changed in the IGV from `view > Preferences > Alignments`.
> 2. If the genome of your interest is not there check if it is available via **More...**. If this is not the case, you can add it manually via the menu **Genomes** -> **Load Genome from...**
>
> ![Select genome in IGV](../../images/igv_select_genome.png "Select genome in IGV")
Expand Down
4 changes: 2 additions & 2 deletions topics/sequence-analysis/tutorials/mapping/jbrowse.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,11 @@ The reads have a direction: they are mapped to the forward or reverse strand, re

> ### {% icon question %} Questions
>
> 1. What does the teardrop shape and line mean in the autogenerated SNP track?
> 1. What do the teardrop shape and line mean in the autogenerated SNP track?
> 2. What do differently coloured reads mean?
>
> > ### {% icon solution %} Solution
> > 1. If enough reads have a different value, then it is marked with a teardrop icon. The coverage plot is marked in height with percentage of reads with a different call at that position
> > 1. If enough reads have a different value, then it is marked with a teardrop icon. The coverage plot is marked in height with the percentage of reads with a different call at that position
> > 2. Colour Codes:
> >
> > Colour | Meaning
Expand Down
30 changes: 15 additions & 15 deletions topics/sequence-analysis/tutorials/mapping/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,9 @@ Sequencing produces a collection of sequences without genomic context. We do not

The short reads do not come with position information, so we do not know what part of the genome they came from. We need to use the sequence of the read itself to find the corresponding region in the reference sequence. But the reference sequence can be quite long (~3 billion bases for human), making it a daunting task to find a matching region. Since our reads are short, there may be several, equally likely places in the reference sequence from which they could have been read. This is especially true for repetitive regions.

In principle, we could do a BLAST analysis to figure out where the sequenced pieces fit best in the known genome. We would need to do that for each of the millions of reads in our sequencing data. Aligning millions of short sequences this way may, however, take a couple of weeks. And we do not really care about the exact base to base correspondence (alignment). What we are really interested in is "where these reads came from". This approach is called **mapping**.
In principle, we could do a BLAST analysis to figure out where the sequenced pieces fit best in the known genome. We would need to do that for each of the millions of reads in our sequencing data. Aligning millions of short sequences this way may, however, take a couple of weeks. And we do not care about the exact base to base correspondence (alignment). What we are interested in is "where these reads came from". This approach is called **mapping**.

In the following we will process a dataset with the mapper **Bowtie2** and we will visualize the data with the program **IGV**.
In the following, we will process a dataset with the mapper **Bowtie2** and we will visualize the data with the program **IGV**.

> ### Agenda
>
Expand Down Expand Up @@ -95,11 +95,11 @@ There is a dedicated tutorial for [quality control]({{ site.baseurl }}{% link to
Read mapping is the process to align the reads on a reference genomes. A mapper takes as input a reference genome and a set of reads. Its aim is to align each read in the set of reads on the reference genome, allowing mismatches, indels and clipping of some short fragments on the two ends of the reads:
![Explanation of mapping](../../images/mapping/mapping.png "Illustration of the mapping process. The input consists of a set of reads and a reference genome. It the middle, it gives the results of mapping: the locations of the reads on the reference genome. The first read is aligned at position 100 and the alignment has two mismatches. The second read is aligned at position 114. It is a local alignment with clippings on the left and on the right. The third read is aligned at position 123. It consists of a 2-base insertion and a 1-base deletion.")
![Explanation of mapping](../../images/mapping/mapping.png "Illustration of the mapping process. The input consists of a set of reads and a reference genome. In the middle, it gives the results of mapping: the locations of the reads on the reference genome. The first read is aligned at position 100 and the alignment has two mismatches. The second read is aligned at position 114. It is a local alignment with clippings on the left and right. The third read is aligned at position 123. It consists of a 2-base insertion and a 1-base deletion.")
We need a reference genome to map the reads on.
{% include topics/sequence-analysis/tutorials/mapping/ref_genome_explanation.md answer_3="This data comes from ChIP-seq of mice, so we will use mm10 (*Mus musculus*)."%}
{% include topics/sequence-analysis/tutorials/mapping/ref_genome_explanation.md answer_3="This data comes from the ChIP-seq of mice, so we will use mm10 (*Mus musculus*)."%}
Currently, there are over 60 different mappers, and their number is growing. In this tutorial, we will use [Bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/), a fast and memory-efficient open-source tool particularly good at aligning sequencing reads of about 50 up to 1,000s of bases to relatively long genomes.
Expand Down Expand Up @@ -132,11 +132,11 @@ Currently, there are over 60 different mappers, and their number is growing. In
> > > ### {% icon solution %} Solution
> > > 1. The information given here is a quantity one. We can see how many sequences are aligned. It does not tell us something about the quality.
> > > 2. ~90% reads have been aligned exactly 1 time
> > > 3. ~7% reads have been aligned concordantly >1 times. These are called multi-mapped reads. It can happen because of repetitions in the reference genome (multiple copies of a gene for example), specially when the reads are small. It is difficult to decide where these sequences come from so most pipelines ignores them. Always check the statistics there to be sure of not removing to much information by discarding them in any downstream analyses.
> > > 3. ~7% reads have been aligned concordantly >1 times. These are called multi-mapped reads. It can happen because of repetitions in the reference genome (multiple copies of a gene for example), particularly when the reads are small. It is difficult to decide where these sequences come from and therefore most of the pipelines ignore them. Always check the statistics there to be sure of not discarding too much information in any downstream analyses.
> > > 4. ~3% pair of reads have not been mapped because
> > > - both reads in the pair aligned but their positions do not concord with pair of reads (`aligned discordantly 1 time`)
> > > - reads of these pairs are multi-mapped (`aligned >1 times` in `pairs aligned 0 times concordantly or discordantly`)
> > > - one read of these paires are mapped but not the paired read (`aligned exactly 1 time` in `pairs aligned 0 times concordantly or discordantly`)
> > > - one read of these pairs are mapped but not the paired read (`aligned exactly 1 time` in `pairs aligned 0 times concordantly or discordantly`)
> > > - the rest are not mapped at all
> > {: .solution }
> {: .question}
Expand All @@ -145,8 +145,8 @@ Currently, there are over 60 different mappers, and their number is growing. In
Checking the mapping statistics is an important step to do before continuing any analyses. There are several potential sources for errors in mapping, including (but not limited to):
- **Polymerase Chain Reaction (PCR) artifacts**: Many HTS methods involve one or multiple PCR steps. PCR errors will show as mismatches in the alignment, and especially errors in early PCR rounds will show up in multiple reads, falsely suggesting genetic variation in the sample. A related error would be PCR duplicates, where the same read pair occurs multiple times, skewing coverage calculations in the alignment.
- **Sequencing errors**: The sequencing machine can make an erroneous call, either for physical reasons (e.g. oil on an Illumina slide), or due to properties of the sequenced DNA (e.g., homopolymers). As sequencing errors are often random, they can be filtered out as singleton reads during variant calling.
- **Polymerase Chain Reaction (PCR) artifacts**: Many high-throughput sequencing (HTS) methods involve one or multiple PCR steps. PCR errors will show as mismatches in the alignment, and especially errors in early PCR rounds will show up in multiple reads, falsely suggesting genetic variation in the sample. A related error would be PCR duplicates, where the same read pair occurs multiple times, skewing coverage calculations in the alignment.
- **Sequencing errors**: The sequencing machine can make an erroneous call either for physical reasons (e.g. oil on an Illumina slide) or due to properties of the sequenced DNA (e.g., homopolymers). As sequencing errors are often random, they can be filtered out as singleton reads during variant calling.
- **Mapping errors**: The mapping algorithm can map a read to the wrong location in the reference. This often happens around repeats or other low-complexity regions.
So if the mapping statistics are not good, you should investigate the cause of these errors before going further in your analyses.
Expand All @@ -157,7 +157,7 @@ After that, you should have a look at the reads and inspect the BAM file where t
{% include topics/sequence-analysis/tutorials/mapping/bam_explanation.md mapper="Bowtie2" %}
So the BAM file integrates many information for each read, in particular the quality of mapping.
The BAM file includes a lot of information about each read, particularly the quality of mapping.
> ### {% icon hands_on %} Hands-on: Summary of mapping quality
> 1. **Samtools stats - generate statistics for BAM dataset** {% icon tool %} with the following parameters
Expand All @@ -169,29 +169,29 @@ So the BAM file integrates many information for each read, in particular the qua
>
> > ### {% icon question %} Questions
> >
> > 1. Which proportion of mismatches are in the mapped reads when aligned to the reference genome?
> > 1. What is the proportion of mismatches in the mapped reads when aligned to the reference genome?
> > 2. What does the error rate represent?
> > 3. What is the average quality? How is it represented?
> > 4. What is the insert size average?
> > 5. How many reads have a mapping quality score below 20?
> >
> > > ### {% icon solution %} Solution
> > > 1. There are ~21,900 mismatches for ~4,753,900 bases mapped so ~$$5 x 10^{-3}$$ mismatches per mapped bases
> > > 2. The error rate is the proportion of mismatches per mapped bases, so the ratio computed right before
> > > 3. The average quality is the mean quality score of the mapping. It is a Phred score, as the one used in the FASTQ file for each nucleotide. But here the score is not per nucleotide, but per read. And it represents the probability of mapping quality
> > > 1. There are ~21,900 mismatches for ~4,753,900 bases mapped which on average produces ~0.005 mismatches per mapped bases.
> > > 2. The error rate is the proportion of mismatches per mapped bases, so the ratio computed right before.
> > > 3. The average quality is the mean quality score of the mapping. It is a Phred score like the one used in the FASTQ file for each nucleotide. But here the score is not per nucleotide, but per read and it represents the probability of mapping quality.
> > > 4. The insert size is the distance between the two reads in the pairs.
> > > 5. To get the info:
> > > 1. **Filter BAM datasets on a variety of attributes** {% icon tool %} with a filter to keep only the reads with a mapping quality >= 20
> > > 2. **Stats generate statistics for BAM dataset** {% icon tool %} on the output of **Filter**
> > >
> > > Before filtering: 95,412 reads - After filtering: 89,664 reads
> > > Before filtering: 95,412 reads and after filtering: 89,664 reads.
> > {: .solution }
> {: .question}
{: .hands_on}
# Visualization using a Genome Browser (IGV)
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations. In the following we will use it to visualize the mapped reads.
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations. In the following, we will use it to visualize the mapped reads.
{% include topics/sequence-analysis/tutorials/mapping/igv.md tool="Bowtie2" region_to_zoom="chr2:98,666,236-98,667,473" %}
Expand Down

0 comments on commit 072f729

Please sign in to comment.