Merge pull request #101 from ARTbio/aftercounting

refresh review exercises week 3
ARTbio · Feb 5, 2024 · a063047 · a063047
2 parents de8bcf7 + 9d80dd3
commit a063047
Show file tree

Hide file tree

Showing 7 changed files with 395 additions and 128 deletions.
diff --git a/docs/bulk_RNAseq-IOC/21_FeatureCounts.md b/docs/bulk_RNAseq-IOC/21_FeatureCounts.md
@@ -47,7 +47,7 @@ and
     - Advanced options
 
         --> Leave folded, no advanced options !
-    - `Execute`
+    - `Run Tool`
 
 ### Repeat the exact same operation twice for the collections Mo and Oc HISAT2 alignments
 

diff --git a/docs/bulk_RNAseq-IOC/24_exercices_week_03_review.md b/docs/bulk_RNAseq-IOC/24_exercices_week_03_review.md
@@ -1,30 +1,7 @@
-## Issues with Slack ?
+## Issue with FeatureCounts ?
 
-## Issues with GitHub ?
-- [x] Does everyone have a GitHub ID ? 
-- [x] Was everyone able to create a readme file and make a pull request to the repository
-      [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
-- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have
-      generated during the first online meeting, with an extension .ga) and to add it in
-      the repository
-      [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
+## Issue with htseq-count ?
 
-## Data upload in PSILO, then in Galaxy from Psilo
-- [x] Did everyone upload the necessary data in its
-      [PSILO account](https://psilo.sorbonne-universite.fr) ?
-- [x] Did everyone succeed to create direct download links ? 
-- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset`
-      in its Galaxy account ?
+## Issue with STAR counts ?
 
-## Issues following the Galaxy training ?
-
-[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html)
-
-- Check whether `Relabel identifiers` tool is understood
-
-- Check whether `Extract element identifiers` tool is understood. Is the output dataset
-  from this tool uploaded in the appropriate GitHub folder ?
-
-## Check input datasets histories of the participants
-
-... and their ability to create appropriate collection for the analysis
+## Issue STAR count formating ?
diff --git a/docs/bulk_RNAseq-IOC/25_DE_intro.md b/docs/bulk_RNAseq-IOC/25_DE_intro.md
@@ -1,36 +1,107 @@
-# Analysis of the differential gene expression using `DESeq2`
+# ![](images/lamp.png){align="absbottom} Statistical analysis of Differential Gene Expression
 
-![](images/lamp.png)
+## The basic ideas of Differential expression analysis
 
-----
+### Gene Counts are Observations of Variables
+We consider that each gene is a ==variable==. Thus DE analysis is dealing with
+==multiple== variables (10 to several tens of thousands).
 
-DESeq2 is a great tool for Differential Gene Expression (DGE) analysis.
-It takes read counts and combines them into a table (with genes in the rows and samples in the columns).
-Importantly, it applies size factor normalization by:
+Each read count is an ==observation== of these variables. Thus, for instance, if your
+experiment is based on biological triplicates in two conditions, you have three
+observation of your gene variables under 2 different conditions (6 observations in total).
 
-- Computing for each gene the geometric mean of read counts across all samples
-- Dividing every gene count by the geometric mean accross samples
-- Using the median of these ratios as a sample’s size factor for normalization
+### Testing
+When we say that we are testing for differential expression, we mean that we are
+performing multiple statistical tests, ==one for each== gene variable. These tests are
+well established mathematical treatments such as the Student Test (t-test), the
+Mann-Whitney U test, the Wilcoxon test or the exact Fisher test, for the most used tests.
+However, note that not all these tests are suitable for discrete count variables.
 
-Multiple factors with several levels can then be incorporated in the analysis.
-After normalization we can compare the response of the expression of any gene to
-the presence of different levels of a factor in a statistically reliable way.
+### Conditions for testing
+As you probably learned during your university studies each of these tests have underlying
+assumptions. The parametric tests require the a-priori knowledge of parameters (mean,
+variance, etc.), or that the distribution of the tested variable follows a specific law
+(normality, continuity, etc.). For instance the parametric student test requires that the
+means of observation is normally distributed, which is the case if the number of
+observation is > 30 (Central Limit Theorem) or if the variable is normally distributed
+(which is not the case for a read count variable !)
 
-In our example, we have samples with two varying factors that can contribute to
-differences in gene expression:
+### Read counts from NGS sequencing are not normally distributed
 
-- Treatment (either treated or untreated)
-- Sequencing type (paired-end or single-end)
+In contrast to the intensity of a probe in a microarray, a read count variable is does not
+follow and gaussian (normal) distribution ! Statistician showed that read counts
+variables, which are discrete variable, follow a generalized Poisson distribution, which
+can be approximated by an inverse binomial distribution (also referred to as negative
+binomial, NB) when the number of observations is low.
 
-Here, treatment is the primary factor that we are interested in.
+From this, it comes that the main tools for NGS DE analysis model read count variables with
+a Poisson (Limma) or a NB law, and use specific tests for differential expression.
 
-The sequencing type is further information we know about the data that might affect
-the analysis. Multi-factor analysis allows us to assess the effect of the treatment,
-while taking the sequencing type into account too.
+Note that these tests are ==parametric test==, which implies that the Mean and the Variance
+must be approximated before the tests. In the case of the Negative Binomial distribution,
+Mean == Variance.
 
-```
-We recommend that you add as many factors as you think may affect gene expression in
-your experiment. It can be the sequencing type like here, but it can also be the
-manipulation (if different persons are involved in the library preparation),
-other batch effects, etc…
-```
+### Shared information between read count variables
+
+Although differential Expression analysis is based on the assumption that gene expression
+variables are **independent**, it happens that these variables share information which can be
+used for better modeling of test parameters *for each test*.
+
+Thus, the main benefit from using Limma, DESeq or edgeR packages is this modeling operation
+which improves the accuracy of the statistical tests for differential expression.
+
+### Multiple Testing
+
+Each test for DE gives rise to a p-value, which is the probability of wrongly rejecting
+hypothesis H0 which is, remember, that there is no difference in gene expression in view
+of the observations of the number of reads.
+
+For instance, when you read p = 0.05, this implies that the gene is differentially
+expressed, with the probability that this conclusion is false being < 0.05. This is the
+type I error.
+
+However, transcriptome analysis implies thousands of tests. It happens that these tests
+also follow a statistical law ! Even if a given test returns a p-value < 0.05, there is,
+in addition a probability that this test was wrong !
+
+Thus, when you perform thousands of test, you know that a proportion of these tests will
+return wrong p-values.
+
+The adjustment of the p-values seeds from this situation: you **must** correct your
+p-values for multiple testing, because in the context of dozen of thousands tests,
+p-values are poor indicators and does not allow to control the False Discovery Rate (type
+I error)
+
+Several methods exist for this correction. The Bonferroni correction is popular and
+relatively conservative, whereas the Benjamini and Hochberg correction, which controls a
+priori the False Discovery Rate is considered as less stringent. We can also cite other
+methods that are not widely used in Biology such as the  Bonferroni Step-down (Holm)
+correction or the Westfall and Young Permutation.
+
+### Normalization
+
+Last, but certainly not the least, to test a read count variable for differential
+expression, a Normalization operation must be performed, since different sequencing depth
+lead to different estimation of gene expression !
+
+This Normalisation is performed differently by the Limma, DESeq or edgeR packages, which
+is responsible a significant part of the differences between the packages.
+
+### R packages used in this companionship
+
+We are going to use most popular R packages DESeq2 and edgeR and it will be interesting to
+compare the results returned by both packages. We will also try to give a shot to Limma
+which was a very popular packages for analysing microarray results. Interestingly, Limma
+has evolved and incorpores several methods to adapt to the more recent NGS RNAseq results.
+
+### References
+
+- **DESeq2**: Anders and Huber, Genome Biology 2010, 11:R106
+  [DOI](https://doi.org/10.1186/gb-2010-11-10-r106)
+- **edgeR**: Robinson, McCarthy and Smyth,
+  Bioinformatics 2010, 26 p 139 [DOI](https://doi.org/10.1093/bioinformatics/btp616)
+- **Limma**: Ritchie, Phipson, Wu, Hu, Law, Shi, et al., Nucleic Acids Res. 2015;43: e47.
+  [DOI](https://doi.org/10.1093/nar/gkv007)
+
+
+---
diff --git a/docs/bulk_RNAseq-IOC/26_deseq2.md b/docs/bulk_RNAseq-IOC/26_deseq2.md
@@ -1,36 +1,104 @@
-![](images/galaxylogo.png)
-
-# `DESeq2`
-
-----
-![](images/tool_small.png)
-
-  1. Let's create a clean fresh history (`wheel` --> `Create New`) and name it DESeq2 ![](images/wheel.png)
-  2. Copy the `.Counts`datasets from your `STAR`/ `HISAT2` history to this new history
-  (`wheel` --> `Copy datasets`)
-  3. Select the `DESeq2` tool with the following parameters:
-      1. `how`: Select group tags corresponding to levels
-      2. In `Factor`:
-          1. In `1: Factor`
-              - `Specify a factor name`: Treatment
-              - In `Factor level`:
-                  - In `1: Factor level`:
-                      - `Specify a factor level`: treated
-                      - `Counts file(s)`: the 3 gene count files with `treat` in their name
-                  - In `2: Factor level`:
-                      - `Specify a factor level`: untreated
-                      - `Counts file(s)`: the 4 gene count files with `untreat` in their name
-          2. Click on `Insert Factor` (not on `Insert Factor level`)
-          3. In `2: Factor`
-              - `Specify a factor name` to Sequencing
-              - In `Factor level`:
-                  - In `1: Factor level`:
-                      - `Specify a factor level`: Paired
-                      - `Counts file(s)`: the 4 gene count files with `paired` in their name
-                  - In `2: Factor level`:
-                      - `Specify a factor level`: Single
-                      - `Counts file(s)`: the 3 gene count files with `single` in their name
-      3. `Files have header?`: Yes
-      4. `Output normalized counts table`: Yes
-      5. `Execute`
-
+# Analysis of differential gene expression in PRJNA630433 using `DESeq2`
+
+## DESeq2 Analysis
+
+To begin, navigate to the history `PRJNA630433 FeatureCounts Counting on HISAT2 bam
+alignments` and copy the three dataset collections of counts generated by FeatureCounts:
+`Dc FeatureCounts counts`, `Mo FeatureCounts counts` and `Oc FeatureCounts counts` into a
+new history that you will name `PRJNA630433 DESeq2 analysis`
+
+Then, search for `DESeq2` in the tool search bar
+
+!!! info "![](images/tool_small.png){width="25" align="absbottom"} `DESeq2` settings"
+    - how
+
+        --> Select datasets per levels
+    - 1: Factor
+
+        --> Tissue
+    - 1: Factor level
+
+        Note that there will be three factor levels in this analysis: Dc, Mo and Oc.
+
+        --> Oc
+
+    - Counts file(s)
+
+        --> select the data collection icon, then `15: Oc FeatureCounts counts`
+    - 2: Factor level
+
+        --> Mo
+
+    - Counts file(s)
+
+        --> select the data collection icon, then `10: Mo FeatureCounts counts`
+    - 3: Factor level (you must click on :heavy_plus_sign: `Insert Factor level`)
+
+        --> Dc
+
+    - Counts file(s)
+
+        --> select the data collection icon, then `5: Mo FeatureCounts counts`
+    - (Optional) provide a tabular file with additional batch factors to include in the model.
+
+        --> Leave to `Nothing selected`
+    - Files have header?
+
+        --> Yes
+    - Choice of Input data
+
+        --> Count data
+    - Advanced options
+
+        --> No, leave folded
+    - Output options
+
+        --> Unfold and check `Output all levels vs all levels of primary factor (use when
+        you have >2 levels for primary factor)` in addition to the already checked
+        `Generate plots for visualizing the analysis results`
+
+        --> Leave `Alpha value for MA-plot` to 0,1: note that this option is used for
+        plots and does not impact DESeq2 results
+    - `Run Tool`
+
+??? warning "Note on the order of Factors levels in the DESeq2 html form"
+    As specified in the help section of the DESeq2 html form, the order of the Factors
+    levels matters ! See why in that section.
+
+    In a nutshell, the Factor level you put as last in the form, will be taken as the
+    reference Factor level.
+
+    Thus in our use case, the condition `Mo` will serve as reference condition for
+    differential gene expression in the DESeq2 analysis.
+
+## Inspect DESeq2 plots
+
+There is a lot of information here which we will discuss online or in live
+
+## Add a missing header to DESeq2 tabular outputs
+
+If you have a look to three datasets in the collection `DESeq2 result files on data 4,
+data 3, and others`, you'll see that a header indicating what is the content of the 7
+columns is missing. This lack of header is inconfortable when you are not very familiar
+with DE analyses.
+
+Indeed, this header should be
+```
+GeneID	Base_mean	log2FC	StdErr	Wald-Stats	P-value	P-adj
+```
+
+Fortunately, there is a nice tool in Galaxy to quickly add this header.
+
+!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Add Header` settings"
+    - List of Column headers (comma delimited, e.g. C1,C2,...) 
+
+        --> `GeneID,Base_mean,log2FC,StdErr,Wald-Stats,P-value,P-adj`
+    - Data File (tab-delimted)
+
+        --> Select the data collection icon, then `DESeq2 result files on data 4, data 3,
+        and others`
+    - `Run Tool`
+
+:warning: Rename the new collection `DESeq2 Results Tables`
+
+---