-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #101 from ARTbio/aftercounting
refresh review exercises week 3
- Loading branch information
Showing
7 changed files
with
395 additions
and
128 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,30 +1,7 @@ | ||
## Issues with Slack ? | ||
## Issue with FeatureCounts ? | ||
|
||
## Issues with GitHub ? | ||
- [x] Does everyone have a GitHub ID ? | ||
- [x] Was everyone able to create a readme file and make a pull request to the repository | ||
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ? | ||
- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have | ||
generated during the first online meeting, with an extension .ga) and to add it in | ||
the repository | ||
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ? | ||
## Issue with htseq-count ? | ||
|
||
## Data upload in PSILO, then in Galaxy from Psilo | ||
- [x] Did everyone upload the necessary data in its | ||
[PSILO account](https://psilo.sorbonne-universite.fr) ? | ||
- [x] Did everyone succeed to create direct download links ? | ||
- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset` | ||
in its Galaxy account ? | ||
## Issue with STAR counts ? | ||
|
||
## Issues following the Galaxy training ? | ||
|
||
[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html) | ||
|
||
- Check whether `Relabel identifiers` tool is understood | ||
|
||
- Check whether `Extract element identifiers` tool is understood. Is the output dataset | ||
from this tool uploaded in the appropriate GitHub folder ? | ||
|
||
## Check input datasets histories of the participants | ||
|
||
... and their ability to create appropriate collection for the analysis | ||
## Issue STAR count formating ? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,36 +1,107 @@ | ||
# Analysis of the differential gene expression using `DESeq2` | ||
# ![](images/lamp.png){align="absbottom} Statistical analysis of Differential Gene Expression | ||
|
||
![](images/lamp.png) | ||
## The basic ideas of Differential expression analysis | ||
|
||
---- | ||
### Gene Counts are Observations of Variables | ||
We consider that each gene is a ==variable==. Thus DE analysis is dealing with | ||
==multiple== variables (10 to several tens of thousands). | ||
|
||
DESeq2 is a great tool for Differential Gene Expression (DGE) analysis. | ||
It takes read counts and combines them into a table (with genes in the rows and samples in the columns). | ||
Importantly, it applies size factor normalization by: | ||
Each read count is an ==observation== of these variables. Thus, for instance, if your | ||
experiment is based on biological triplicates in two conditions, you have three | ||
observation of your gene variables under 2 different conditions (6 observations in total). | ||
|
||
- Computing for each gene the geometric mean of read counts across all samples | ||
- Dividing every gene count by the geometric mean accross samples | ||
- Using the median of these ratios as a sample’s size factor for normalization | ||
### Testing | ||
When we say that we are testing for differential expression, we mean that we are | ||
performing multiple statistical tests, ==one for each== gene variable. These tests are | ||
well established mathematical treatments such as the Student Test (t-test), the | ||
Mann-Whitney U test, the Wilcoxon test or the exact Fisher test, for the most used tests. | ||
However, note that not all these tests are suitable for discrete count variables. | ||
|
||
Multiple factors with several levels can then be incorporated in the analysis. | ||
After normalization we can compare the response of the expression of any gene to | ||
the presence of different levels of a factor in a statistically reliable way. | ||
### Conditions for testing | ||
As you probably learned during your university studies each of these tests have underlying | ||
assumptions. The parametric tests require the a-priori knowledge of parameters (mean, | ||
variance, etc.), or that the distribution of the tested variable follows a specific law | ||
(normality, continuity, etc.). For instance the parametric student test requires that the | ||
means of observation is normally distributed, which is the case if the number of | ||
observation is > 30 (Central Limit Theorem) or if the variable is normally distributed | ||
(which is not the case for a read count variable !) | ||
|
||
In our example, we have samples with two varying factors that can contribute to | ||
differences in gene expression: | ||
### Read counts from NGS sequencing are not normally distributed | ||
|
||
- Treatment (either treated or untreated) | ||
- Sequencing type (paired-end or single-end) | ||
In contrast to the intensity of a probe in a microarray, a read count variable is does not | ||
follow and gaussian (normal) distribution ! Statistician showed that read counts | ||
variables, which are discrete variable, follow a generalized Poisson distribution, which | ||
can be approximated by an inverse binomial distribution (also referred to as negative | ||
binomial, NB) when the number of observations is low. | ||
|
||
Here, treatment is the primary factor that we are interested in. | ||
From this, it comes that the main tools for NGS DE analysis model read count variables with | ||
a Poisson (Limma) or a NB law, and use specific tests for differential expression. | ||
|
||
The sequencing type is further information we know about the data that might affect | ||
the analysis. Multi-factor analysis allows us to assess the effect of the treatment, | ||
while taking the sequencing type into account too. | ||
Note that these tests are ==parametric test==, which implies that the Mean and the Variance | ||
must be approximated before the tests. In the case of the Negative Binomial distribution, | ||
Mean == Variance. | ||
|
||
``` | ||
We recommend that you add as many factors as you think may affect gene expression in | ||
your experiment. It can be the sequencing type like here, but it can also be the | ||
manipulation (if different persons are involved in the library preparation), | ||
other batch effects, etc… | ||
``` | ||
### Shared information between read count variables | ||
|
||
Although differential Expression analysis is based on the assumption that gene expression | ||
variables are **independent**, it happens that these variables share information which can be | ||
used for better modeling of test parameters *for each test*. | ||
|
||
Thus, the main benefit from using Limma, DESeq or edgeR packages is this modeling operation | ||
which improves the accuracy of the statistical tests for differential expression. | ||
|
||
### Multiple Testing | ||
|
||
Each test for DE gives rise to a p-value, which is the probability of wrongly rejecting | ||
hypothesis H0 which is, remember, that there is no difference in gene expression in view | ||
of the observations of the number of reads. | ||
|
||
For instance, when you read p = 0.05, this implies that the gene is differentially | ||
expressed, with the probability that this conclusion is false being < 0.05. This is the | ||
type I error. | ||
|
||
However, transcriptome analysis implies thousands of tests. It happens that these tests | ||
also follow a statistical law ! Even if a given test returns a p-value < 0.05, there is, | ||
in addition a probability that this test was wrong ! | ||
|
||
Thus, when you perform thousands of test, you know that a proportion of these tests will | ||
return wrong p-values. | ||
|
||
The adjustment of the p-values seeds from this situation: you **must** correct your | ||
p-values for multiple testing, because in the context of dozen of thousands tests, | ||
p-values are poor indicators and does not allow to control the False Discovery Rate (type | ||
I error) | ||
|
||
Several methods exist for this correction. The Bonferroni correction is popular and | ||
relatively conservative, whereas the Benjamini and Hochberg correction, which controls a | ||
priori the False Discovery Rate is considered as less stringent. We can also cite other | ||
methods that are not widely used in Biology such as the Bonferroni Step-down (Holm) | ||
correction or the Westfall and Young Permutation. | ||
|
||
### Normalization | ||
|
||
Last, but certainly not the least, to test a read count variable for differential | ||
expression, a Normalization operation must be performed, since different sequencing depth | ||
lead to different estimation of gene expression ! | ||
|
||
This Normalisation is performed differently by the Limma, DESeq or edgeR packages, which | ||
is responsible a significant part of the differences between the packages. | ||
|
||
### R packages used in this companionship | ||
|
||
We are going to use most popular R packages DESeq2 and edgeR and it will be interesting to | ||
compare the results returned by both packages. We will also try to give a shot to Limma | ||
which was a very popular packages for analysing microarray results. Interestingly, Limma | ||
has evolved and incorpores several methods to adapt to the more recent NGS RNAseq results. | ||
|
||
### References | ||
|
||
- **DESeq2**: Anders and Huber, Genome Biology 2010, 11:R106 | ||
[DOI](https://doi.org/10.1186/gb-2010-11-10-r106) | ||
- **edgeR**: Robinson, McCarthy and Smyth, | ||
Bioinformatics 2010, 26 p 139 [DOI](https://doi.org/10.1093/bioinformatics/btp616) | ||
- **Limma**: Ritchie, Phipson, Wu, Hu, Law, Shi, et al., Nucleic Acids Res. 2015;43: e47. | ||
[DOI](https://doi.org/10.1093/nar/gkv007) | ||
|
||
|
||
--- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,36 +1,104 @@ | ||
![](images/galaxylogo.png) | ||
|
||
# `DESeq2` | ||
|
||
---- | ||
![](images/tool_small.png) | ||
|
||
1. Let's create a clean fresh history (`wheel` --> `Create New`) and name it DESeq2 ![](images/wheel.png) | ||
2. Copy the `.Counts`datasets from your `STAR`/ `HISAT2` history to this new history | ||
(`wheel` --> `Copy datasets`) | ||
3. Select the `DESeq2` tool with the following parameters: | ||
1. `how`: Select group tags corresponding to levels | ||
2. In `Factor`: | ||
1. In `1: Factor` | ||
- `Specify a factor name`: Treatment | ||
- In `Factor level`: | ||
- In `1: Factor level`: | ||
- `Specify a factor level`: treated | ||
- `Counts file(s)`: the 3 gene count files with `treat` in their name | ||
- In `2: Factor level`: | ||
- `Specify a factor level`: untreated | ||
- `Counts file(s)`: the 4 gene count files with `untreat` in their name | ||
2. Click on `Insert Factor` (not on `Insert Factor level`) | ||
3. In `2: Factor` | ||
- `Specify a factor name` to Sequencing | ||
- In `Factor level`: | ||
- In `1: Factor level`: | ||
- `Specify a factor level`: Paired | ||
- `Counts file(s)`: the 4 gene count files with `paired` in their name | ||
- In `2: Factor level`: | ||
- `Specify a factor level`: Single | ||
- `Counts file(s)`: the 3 gene count files with `single` in their name | ||
3. `Files have header?`: Yes | ||
4. `Output normalized counts table`: Yes | ||
5. `Execute` | ||
|
||
# Analysis of differential gene expression in PRJNA630433 using `DESeq2` | ||
|
||
## DESeq2 Analysis | ||
|
||
To begin, navigate to the history `PRJNA630433 FeatureCounts Counting on HISAT2 bam | ||
alignments` and copy the three dataset collections of counts generated by FeatureCounts: | ||
`Dc FeatureCounts counts`, `Mo FeatureCounts counts` and `Oc FeatureCounts counts` into a | ||
new history that you will name `PRJNA630433 DESeq2 analysis` | ||
|
||
Then, search for `DESeq2` in the tool search bar | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `DESeq2` settings" | ||
- how | ||
|
||
--> Select datasets per levels | ||
- 1: Factor | ||
|
||
--> Tissue | ||
- 1: Factor level | ||
|
||
Note that there will be three factor levels in this analysis: Dc, Mo and Oc. | ||
|
||
--> Oc | ||
|
||
- Counts file(s) | ||
|
||
--> select the data collection icon, then `15: Oc FeatureCounts counts` | ||
- 2: Factor level | ||
|
||
--> Mo | ||
|
||
- Counts file(s) | ||
|
||
--> select the data collection icon, then `10: Mo FeatureCounts counts` | ||
- 3: Factor level (you must click on :heavy_plus_sign: `Insert Factor level`) | ||
|
||
--> Dc | ||
|
||
- Counts file(s) | ||
|
||
--> select the data collection icon, then `5: Mo FeatureCounts counts` | ||
- (Optional) provide a tabular file with additional batch factors to include in the model. | ||
|
||
--> Leave to `Nothing selected` | ||
- Files have header? | ||
|
||
--> Yes | ||
- Choice of Input data | ||
|
||
--> Count data | ||
- Advanced options | ||
|
||
--> No, leave folded | ||
- Output options | ||
|
||
--> Unfold and check `Output all levels vs all levels of primary factor (use when | ||
you have >2 levels for primary factor)` in addition to the already checked | ||
`Generate plots for visualizing the analysis results` | ||
|
||
--> Leave `Alpha value for MA-plot` to 0,1: note that this option is used for | ||
plots and does not impact DESeq2 results | ||
- `Run Tool` | ||
|
||
??? warning "Note on the order of Factors levels in the DESeq2 html form" | ||
As specified in the help section of the DESeq2 html form, the order of the Factors | ||
levels matters ! See why in that section. | ||
|
||
In a nutshell, the Factor level you put as last in the form, will be taken as the | ||
reference Factor level. | ||
|
||
Thus in our use case, the condition `Mo` will serve as reference condition for | ||
differential gene expression in the DESeq2 analysis. | ||
|
||
## Inspect DESeq2 plots | ||
|
||
There is a lot of information here which we will discuss online or in live | ||
|
||
## Add a missing header to DESeq2 tabular outputs | ||
|
||
If you have a look to three datasets in the collection `DESeq2 result files on data 4, | ||
data 3, and others`, you'll see that a header indicating what is the content of the 7 | ||
columns is missing. This lack of header is inconfortable when you are not very familiar | ||
with DE analyses. | ||
|
||
Indeed, this header should be | ||
``` | ||
GeneID Base_mean log2FC StdErr Wald-Stats P-value P-adj | ||
``` | ||
|
||
Fortunately, there is a nice tool in Galaxy to quickly add this header. | ||
|
||
!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Add Header` settings" | ||
- List of Column headers (comma delimited, e.g. C1,C2,...) | ||
|
||
--> `GeneID,Base_mean,log2FC,StdErr,Wald-Stats,P-value,P-adj` | ||
- Data File (tab-delimted) | ||
|
||
--> Select the data collection icon, then `DESeq2 result files on data 4, data 3, | ||
and others` | ||
- `Run Tool` | ||
|
||
:warning: Rename the new collection `DESeq2 Results Tables` | ||
|
||
--- |
Oops, something went wrong.