Merge pull request #99 from ARTbio/visu-counting

Finish visualisation exercises and go to read counting
ARTbio · Jan 23, 2024 · 02083a0 · 02083a0
2 parents f03eebf + 2ee8e32
commit 02083a0
Show file tree

Hide file tree

Showing 16 changed files with 374 additions and 137 deletions.
diff --git a/docs/bulk_RNAseq-IOC/06_read_filtering.md b/docs/bulk_RNAseq-IOC/06_read_filtering.md
diff --git a/docs/bulk_RNAseq-IOC/06_sequence_quality_read_filtering.md b/docs/bulk_RNAseq-IOC/06_sequence_quality_read_filtering.md
@@ -0,0 +1,246 @@
+## Fastq files and Fastq quality
+
+Next generation sequencing report sequences in the FASTQ format.
+
+In this FASQTQ format, both the sequence letter and quality score are each encoded with a
+single ASCII character for brevity and for alignment between a reported nucleotide and
+the quality of its calling.
+
+### FASTQ format
+A FASTQ file has four line-separated fields per sequence:
+
+- Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
+- Line 2 is the raw sequence letters.
+- Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
+- Line 4 encodes the quality values for the sequence in Field 2, and must contain the same number of symbols as letters in the sequence.
+
+!!! info "A FASTQ file containing a single sequence might look like this:"
+    ```
+    @A00680:51:HF2TKDRXX:2:2133:6596:12289/1
+    AGAGTAAGTCTTTGTATTTTATGCTACTGTACCTCTGGGATTAATTGCTC
+    +
+    BFCHBBEGAHDDDHABCDDDBCHFDCFDHABGEDGDHGGCCDBECEHFDG
+                  123
+    ```
+
+    | Q letter | Q score | Prob(Err) |
+    | -------- | ------- | --------- |
+    | A (Pos 1)| 32      | 0.00063   |
+    | B (Pos 2)| 33      | 0.0005    |
+    | C (Pos 3)| 34      | 0.00039   |
+
+The quality score (or Q-score) expresses an error probability. In particular, it serves as
+a convenient and compact way to communicate very small error probabilities.
+
+Given an assertion, A, the quality score, Q(A), expresses the probability that A is not true,
+P(~A), according to the relationship:
+
+`Q(A) =-10 log10(P(~A))`
+
+The relationship between the quality score and error probability is demonstrated with the
+following table:
+
+| Quality score, Q(A) | Error probability, P(~A) |
+| ------------------- | ------------------------ |
+| 10                  | 0.1                      |
+| 20                  | 0.01                     |
+| 30                  | 0.001                    |
+
+### Quality Score Encoding
+In FASTQ files, quality scores are encoded into a compact form, which uses only 1 byte per
+quality value. In this encoding, the quality score is represented as the character with an
+ASCII code equal to its value + 33 (offset of +33). The following table demonstrates the
+relationship between the encoding character, its ASCII code, and the quality score represented.
+
+??? info "relationship between the encoding character, its ASCII code, and the quality score represented"
+
+    | Symbol | ASCII Code | Q-Score |
+    | ------ | ---------- | ------- |
+    | !      | 33         | 0       |
+    | "      | 34         | 1       |
+    | #      | 35         | 2       |
+    | $      | 36         | 3       |
+    | %      | 37         | 4       |
+    | &      | 38         | 5       |
+    | '      | 39         | 6       |
+    | (      | 40         | 7       |
+    | )      | 41         | 8       |
+    | \*     | 42         | 9       |
+    | +      | 43         | 10      |
+    | ,      | 44         | 11      |
+    | \-     | 45         | 12      |
+    | .      | 46         | 13      |
+    | /      | 47         | 14      |
+    | 0      | 48         | 15      |
+    | 1      | 49         | 16      |
+    | 2      | 50         | 17      |
+    | 3      | 51         | 18      |
+    | 4      | 52         | 19      |
+    | 5      | 53         | 20      |
+    | 6      | 54         | 21      |
+    | 7      | 55         | 22      |
+    | 8      | 56         | 23      |
+    | 9      | 57         | 24      |
+    | :      | 58         | 25      |
+    | ;      | 59         | 26      |
+    | <      | 60         | 27      |
+    | \=     | 61         | 28      |
+    | \>     | 62         | 29      |
+    | ?      | 63         | 30      |
+    | @      | 64         | 31      |
+    | A      | 65         | 32      |
+    | B      | 66         | 33      |
+    | C      | 67         | 34      |
+    | D      | 68         | 35      |
+    | E      | 69         | 36      |
+    | F      | 70         | 37      |
+    | G      | 71         | 38      |
+    | H      | 72         | 39      |
+    | I      | 73         | 40      |
+
+### Assessing quality with FastQC
+
+The [FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) tool provides a
+simple way to examine quality metrics for raw sequence data coming from high throughput
+sequencing platforms. It provides a modular set of analyses which you can use to give a
+quick impression of whether your data has any problems of which you should be aware before
+doing any further analysis.
+
+The main functions of FastQC are:
+
+- [x] Import of data from BAM, SAM or FastQ files (any variant)
+- [x] Providing a quick overview to tell you in which areas there may be problems
+- [x] Summary graphs and tables to quickly assess your data
+- [x] Export of results to an HTML based permanent report
+
+### Quick discussion of quality metrics assessed by FastQC
+
+??? info "Basic Statistics... are basic statitics"
+    | Measure                           | Value                   |
+    | --------------------------------- | ----------------------- |
+    | Filename                          | GCB_Mg_S12_forward.gz   |
+    | File type                         | Conventional base calls |
+    | Encoding                          | Sanger / Illumina 1.9   |
+    | Total Sequences                   | 7998924                 |
+    | Total Bases                       | 1.2 Gbp                 |
+    | Sequences flagged as poor quality | 0                       |
+    | Sequence length                   | 151                     |
+    | %GC                               | 48                      |
+
+??? info "Per base sequence quality"
+    is expected to be at minimum of ~30 (.001 probability of wrong base calling)
+
+    ![](images/fastQCscreenshots/per_base_sequence_quality.png){width="800"}
+
+??? info "Per tile sequence quality"
+    is the representation of the mean Quality in function of the position on the
+    Affymetrix chip. Non homogeneous quality suggest issues during the sequencing
+    procedure, of low quality chip.
+
+    ![](images/fastQCscreenshots/Per_tile_sequence_quality.png){width="800"}
+
+??? info "Per sequence quality scores"
+    Shows the distribution of the quality of all sequences. Generally a high pic of
+    sequences with high quality is observed, whereas a long tail of small fractions of
+    total number of sequences have low quality.
+
+    ![](images/fastQCscreenshots/Per_sequence_quality_scores.png){width="800"}
+
+??? info ":warning: Per base sequence content"
+    Shows the % of each of the four nucleotides in function of their position in the read.
+    Divergent fraction, often at the beginning of the reads may be due to one or several
+    combined factors:
+
+    - Illumina devices used to use the 5 first sequencing cycles to calibrate their
+      fluorescence detector. Adjustment of the sensitivity of the four wave length chanels
+      during the first cycles may cause divergent Nucleotides contents. This factor is
+      likely minor with the recent Illumina devices.
+    - If a small number of nucleotides belongs to an adapter or sequence primer, A strong
+      bias in nucleotide contents is expected.
+    - Nowadays, preparation of library using a transposase to generate short fragments
+      (tagmentation) has become a popular procedure. This transposase does not cut completly
+      randomly the DNA. Even a slight bias in tagmentation sites may be responsible for
+      divergent fractions of nucleotides at the beginning of the reads.
+
+    ![](images/fastQCscreenshots/Per_base_sequence_content.png){width="800"}
+
+
+??? info "Per sequence GC content"
+    The distribution of the GC content of sequencing read is in general remarkably stable
+    for a given organism.
+    Divergence observed from the theoritical distribution may indicate an overrepresentation
+    of a few sequences such as ribosomal RNAs or tRNAs (as is the case in the exemple below).
+
+    Contamination of samples by another
+    organism coming either from the experiment or from the experiment of another researcher
+    may also be the cause of divergence from the theoritical distribution.
+
+    ![](images/fastQCscreenshots/Per_sequence_GC_content.png){width="800"}
+
+??? info "Per base N content"
+    is expected to be low, otherwise there must have been issues during the base calling
+    by the illumina device.
+
+??? info "Sequence Length Distribution"
+    is generally an unimodal curve with a pic at the lenght of the reads provided by the
+    platform.
+
+??? info ":warning: Sequence Duplication Levels"
+    is an important metrics since some analysts are advising to remove PCR duplicates from
+    sequencing datasets before perfoming read counting. Althought we must keep in mind that
+    identical molecules (same sequence and same length) may still be biological duplicates
+    (Think about tRNAs or miRNAs for instance), the advice seems holding for mRNAs.
+
+    If you plan to remove "PCR duplicates", this plot allows you to anticipate how usable
+    reads will be available for RNA profiling after de-duplications.
+
+    High duplication levels may be due to the presence of highly expressed RNAs as mentionned
+    above, or to over PCR-amplification of the library.
+
+    ![](images/fastQCscreenshots/Sequence_Duplication_Levels.png){width="800"}
+
+??? info "Overrepresented sequences"
+    Are indicative of the presence of contaminating adapters, rRNA, tRNA, etc.
+
+??? info "Adapter Content"
+    Provide the detection of know adapter sequences, generally (but not always) next to
+    the ends of reads.
+
+    ![](images/fastQCscreenshots/Adapter_Content.png){width="800"}
+
+
+## Read filtering
+It is tempting to **filter** the data to get “good reads” and **discard** "bad reads"
+including: 
+
+- Reads with low quality alignments
+- Reads suspected to be PCR duplicates
+
+:warning: **HOWEVER**
+
+  ==Discarding reads is a radical decision because it changes the counts==
+
+- [ ] Why low quality reads should be skipped if they were aligned ?
+
+      In the case of RNA profiling, we definitely do not advise to remove reads with low
+      quality, *unless* this low quality impacts significantly the number of read effectively
+      aligned to the reference genome. Generally, low quality reads are not randomly distributed,
+      but, for instance, associated with genes with high GC content. In this case, removing
+      low quality reads may skew your counts in favor of AT rich transcripts.
+
+
+- [ ] When we remove PCR duplicates (exact same sequence and exact same location), we are
+      never 100% sure that we are removing real PCR duplicates, unless a specific bar coding
+      system was used (which is done in single-cell RNAseq protocols).
+
+      Of course, over amplification of libraries must be corrected by removing duplicated
+      sequences. But this can only be done after careful examination of the metrics to
+      support this diagnosis.
+
+      On another line, ultra deep sequencing of a small genome (or its RNAs) will inevitably cause
+      the apparition of duplicated sequences. We can view this trend as a "saturation process",
+      well known of geneticist. It may be difficult to deconvolve this expected trend
+      from bias due to over amplification of libraries.
+
+
+
diff --git a/docs/bulk_RNAseq-IOC/16_strandness.md b/docs/bulk_RNAseq-IOC/16_strandness.md
@@ -86,7 +86,7 @@ for you to use it once.
 
 ## ![](images/tool_small.png){width="30" align="absbottom"} Use of `Infer Experiment` tool
 
-Unfortunalty, the `Infer Experiment` tool does not with annotations in GTF format, but rather
+Unfortunalty, the `Infer Experiment` tool does not work with annotations in GTF format, but rather
 in another format: the BED12 format. No worries, there is a converter tool for this.
 
 !!! info "![](images/tool_small.png){width="25" align="absbottom"} `Convert GTF to BED12` settings"

diff --git a/docs/bulk_RNAseq-IOC/18-1_UCSC_visualisation.md b/docs/bulk_RNAseq-IOC/18-1_UCSC_visualisation.md
@@ -101,14 +101,30 @@ If you deploy the collection and its first dataset, you should notice that
 
 ## IGV  ![](images/oeil.png)
 
-To use IGV with galaxy you need to have this tool on your computer. (If not you can download IGV from [their main site](http://software.broadinstitute.org/software/igv/).)
-
-1. Open locally IGV
-2. click the IGV local ling as indicated by the red arrow.
-
-![](images/igv.png)
-
-Zoom to chr4:540,000-560,000 (Chromosome 4 between 540 kb to 560 kb)
+To use IGV with galaxy you need to have this tool on your computer. (If not, you can download
+IGV from [their main site](https://igv.org/doc/desktop/#DownloadPage/).)
+
+- [x] Open locally IGV
+- [x] Click again on the "histogram icon"
+
+  <center>![](images/collection_for_ucsc_visu.png){width="200"}</center>
+
+- [x] This time, instead of "display at UCSC (main , test )", click the
+    `local` link in the line
+
+    `2. display with IGV ( local , Mouse mm10 )`
+
+    ![](images/igv.png){width="300"}
+
+- [x] :warning: Depending on the size of the bam file you visualise, it can takes several
+  minutes before data are effectively displayed in the IGV browser.
+- [x] As a last note: IGV has become a powerful local genome browser with
+  integration of remote Galaxy datasets. However the learning curve of IGV is flat at the
+  beginning since it requires a general understanding of network as well as Input/Output
+  mechanisms in Information Technologie. Then, when these issues are mastered, the slope
+  of the IGV LC increase significantly since the Graphical Interface of IGV is in contrast
+  rather intuitive.
+
+    :nerd_face: Hang in there, it's definitely worth it ! 
 
 ----
-![](images/gene_counting.png)
diff --git a/docs/bulk_RNAseq-IOC/19_exercices_week_02_review.md b/docs/bulk_RNAseq-IOC/19_exercices_week_02_review.md
@@ -1,30 +1,9 @@
-## Issues with Slack ?
+## Issue with Library strandness ?
 
-## Issues with GitHub ?
-- [x] Does everyone have a GitHub ID ? 
-- [x] Was everyone able to create a readme file and make a pull request to the repository
-      [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
-- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have
-      generated during the first online meeting, with an extension .ga) and to add it in
-      the repository
-      [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
+## Issue with HiSAT2 ?
 
-## Data upload in PSILO, then in Galaxy from Psilo
-- [x] Did everyone upload the necessary data in its
-      [PSILO account](https://psilo.sorbonne-universite.fr) ?
-- [x] Did everyone succeed to create direct download links ? 
-- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset`
-      in its Galaxy account ?
+## Issue with STAR ?
 
-## Issues following the Galaxy training ?
+## Issue with UCSC visualisation ?
 
-[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html)
-
-- Check whether `Relabel identifiers` tool is understood
-
-- Check whether `Extract element identifiers` tool is understood. Is the output dataset
-  from this tool uploaded in the appropriate GitHub folder ?
-
-## Check input datasets histories of the participants
-
-... and their ability to create appropriate collection for the analysis
+## Issue with IGV visualisation ?
diff --git a/docs/bulk_RNAseq-IOC/20_intro_counting.md b/docs/bulk_RNAseq-IOC/20_intro_counting.md
@@ -28,11 +28,19 @@ From the image above, we can compute:
 Two main tools could be used for that: HTSeq-count
 ([Anders et al, Bioinformatics, 2015](https://academic.oup.com/bioinformatics/article/31/2/166/2366196))
 or featureCounts ([Liao et al, Bioinformatics, 2014](https://academic.oup.com/bioinformatics/article/31/2/166/2366196)).
-FeatureCounts is considerably faster and requires far less computational resources,
-so we will use it here.
 
-In principle, the counting of reads overlapping with genomic features is a fairly
-simple task. But there are some details that need to be given to featureCounts:
-for example the strandness.
-
-![](images/sequencing_strategies.png)
+FeatureCounts is considerably faster and requires far less computational resources.
+
+HTSeq-counts was originally developed by Simon Anders (the developer of DESeq2 and DEXSeq).
+Thus, there is certainly a guaranty of quality. However, it is less easy to use and requires
+several additional steps, in particular to carefully control the GTF/GFF files taken as
+an input by HTSeq-counts.
+
+!!! note "Three important points"
+    - [x] Here again (in the counting task), the genome annotation file, either a GTF or a GFF3, is
+    central, regardless of the chosen counting software.
+    - [x] The library strandness is absolutly required for an accurate gene read counting.
+    - [x] If you are working with paired-end sequencing datasets, you will not count reads.
+      Instead, you will count ==fragments==. If you count reads in this situation, you will
+      overestimate gene expressions !
+---