Skip to content

Commit

Permalink
Merge pull request #101 from ARTbio/aftercounting
Browse files Browse the repository at this point in the history
refresh review exercises week 3
  • Loading branch information
drosofff authored Feb 5, 2024
2 parents de8bcf7 + 9d80dd3 commit a063047
Show file tree
Hide file tree
Showing 7 changed files with 395 additions and 128 deletions.
2 changes: 1 addition & 1 deletion docs/bulk_RNAseq-IOC/21_FeatureCounts.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ and
- Advanced options

--> Leave folded, no advanced options !
- `Execute`
- `Run Tool`

### Repeat the exact same operation twice for the collections Mo and Oc HISAT2 alignments

Expand Down
31 changes: 4 additions & 27 deletions docs/bulk_RNAseq-IOC/24_exercices_week_03_review.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,7 @@
## Issues with Slack ?
## Issue with FeatureCounts ?

## Issues with GitHub ?
- [x] Does everyone have a GitHub ID ?
- [x] Was everyone able to create a readme file and make a pull request to the repository
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have
generated during the first online meeting, with an extension .ga) and to add it in
the repository
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
## Issue with htseq-count ?

## Data upload in PSILO, then in Galaxy from Psilo
- [x] Did everyone upload the necessary data in its
[PSILO account](https://psilo.sorbonne-universite.fr) ?
- [x] Did everyone succeed to create direct download links ?
- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset`
in its Galaxy account ?
## Issue with STAR counts ?

## Issues following the Galaxy training ?

[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html)

- Check whether `Relabel identifiers` tool is understood

- Check whether `Extract element identifiers` tool is understood. Is the output dataset
from this tool uploaded in the appropriate GitHub folder ?

## Check input datasets histories of the participants

... and their ability to create appropriate collection for the analysis
## Issue STAR count formating ?
123 changes: 97 additions & 26 deletions docs/bulk_RNAseq-IOC/25_DE_intro.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,107 @@
# Analysis of the differential gene expression using `DESeq2`
# ![](images/lamp.png){align="absbottom} Statistical analysis of Differential Gene Expression

![](images/lamp.png)
## The basic ideas of Differential expression analysis

----
### Gene Counts are Observations of Variables
We consider that each gene is a ==variable==. Thus DE analysis is dealing with
==multiple== variables (10 to several tens of thousands).

DESeq2 is a great tool for Differential Gene Expression (DGE) analysis.
It takes read counts and combines them into a table (with genes in the rows and samples in the columns).
Importantly, it applies size factor normalization by:
Each read count is an ==observation== of these variables. Thus, for instance, if your
experiment is based on biological triplicates in two conditions, you have three
observation of your gene variables under 2 different conditions (6 observations in total).

- Computing for each gene the geometric mean of read counts across all samples
- Dividing every gene count by the geometric mean accross samples
- Using the median of these ratios as a sample’s size factor for normalization
### Testing
When we say that we are testing for differential expression, we mean that we are
performing multiple statistical tests, ==one for each== gene variable. These tests are
well established mathematical treatments such as the Student Test (t-test), the
Mann-Whitney U test, the Wilcoxon test or the exact Fisher test, for the most used tests.
However, note that not all these tests are suitable for discrete count variables.

Multiple factors with several levels can then be incorporated in the analysis.
After normalization we can compare the response of the expression of any gene to
the presence of different levels of a factor in a statistically reliable way.
### Conditions for testing
As you probably learned during your university studies each of these tests have underlying
assumptions. The parametric tests require the a-priori knowledge of parameters (mean,
variance, etc.), or that the distribution of the tested variable follows a specific law
(normality, continuity, etc.). For instance the parametric student test requires that the
means of observation is normally distributed, which is the case if the number of
observation is > 30 (Central Limit Theorem) or if the variable is normally distributed
(which is not the case for a read count variable !)

In our example, we have samples with two varying factors that can contribute to
differences in gene expression:
### Read counts from NGS sequencing are not normally distributed

- Treatment (either treated or untreated)
- Sequencing type (paired-end or single-end)
In contrast to the intensity of a probe in a microarray, a read count variable is does not
follow and gaussian (normal) distribution ! Statistician showed that read counts
variables, which are discrete variable, follow a generalized Poisson distribution, which
can be approximated by an inverse binomial distribution (also referred to as negative
binomial, NB) when the number of observations is low.

Here, treatment is the primary factor that we are interested in.
From this, it comes that the main tools for NGS DE analysis model read count variables with
a Poisson (Limma) or a NB law, and use specific tests for differential expression.

The sequencing type is further information we know about the data that might affect
the analysis. Multi-factor analysis allows us to assess the effect of the treatment,
while taking the sequencing type into account too.
Note that these tests are ==parametric test==, which implies that the Mean and the Variance
must be approximated before the tests. In the case of the Negative Binomial distribution,
Mean == Variance.

```
We recommend that you add as many factors as you think may affect gene expression in
your experiment. It can be the sequencing type like here, but it can also be the
manipulation (if different persons are involved in the library preparation),
other batch effects, etc…
```
### Shared information between read count variables

Although differential Expression analysis is based on the assumption that gene expression
variables are **independent**, it happens that these variables share information which can be
used for better modeling of test parameters *for each test*.

Thus, the main benefit from using Limma, DESeq or edgeR packages is this modeling operation
which improves the accuracy of the statistical tests for differential expression.

### Multiple Testing

Each test for DE gives rise to a p-value, which is the probability of wrongly rejecting
hypothesis H0 which is, remember, that there is no difference in gene expression in view
of the observations of the number of reads.

For instance, when you read p = 0.05, this implies that the gene is differentially
expressed, with the probability that this conclusion is false being < 0.05. This is the
type I error.

However, transcriptome analysis implies thousands of tests. It happens that these tests
also follow a statistical law ! Even if a given test returns a p-value < 0.05, there is,
in addition a probability that this test was wrong !

Thus, when you perform thousands of test, you know that a proportion of these tests will
return wrong p-values.

The adjustment of the p-values seeds from this situation: you **must** correct your
p-values for multiple testing, because in the context of dozen of thousands tests,
p-values are poor indicators and does not allow to control the False Discovery Rate (type
I error)

Several methods exist for this correction. The Bonferroni correction is popular and
relatively conservative, whereas the Benjamini and Hochberg correction, which controls a
priori the False Discovery Rate is considered as less stringent. We can also cite other
methods that are not widely used in Biology such as the Bonferroni Step-down (Holm)
correction or the Westfall and Young Permutation.

### Normalization

Last, but certainly not the least, to test a read count variable for differential
expression, a Normalization operation must be performed, since different sequencing depth
lead to different estimation of gene expression !

This Normalisation is performed differently by the Limma, DESeq or edgeR packages, which
is responsible a significant part of the differences between the packages.

### R packages used in this companionship

We are going to use most popular R packages DESeq2 and edgeR and it will be interesting to
compare the results returned by both packages. We will also try to give a shot to Limma
which was a very popular packages for analysing microarray results. Interestingly, Limma
has evolved and incorpores several methods to adapt to the more recent NGS RNAseq results.

### References

- **DESeq2**: Anders and Huber, Genome Biology 2010, 11:R106
[DOI](https://doi.org/10.1186/gb-2010-11-10-r106)
- **edgeR**: Robinson, McCarthy and Smyth,
Bioinformatics 2010, 26 p 139 [DOI](https://doi.org/10.1093/bioinformatics/btp616)
- **Limma**: Ritchie, Phipson, Wu, Hu, Law, Shi, et al., Nucleic Acids Res. 2015;43: e47.
[DOI](https://doi.org/10.1093/nar/gkv007)


---
140 changes: 104 additions & 36 deletions docs/bulk_RNAseq-IOC/26_deseq2.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,104 @@
![](images/galaxylogo.png)

# `DESeq2`

----
![](images/tool_small.png)

1. Let's create a clean fresh history (`wheel` --> `Create New`) and name it DESeq2 ![](images/wheel.png)
2. Copy the `.Counts`datasets from your `STAR`/ `HISAT2` history to this new history
(`wheel` --> `Copy datasets`)
3. Select the `DESeq2` tool with the following parameters:
1. `how`: Select group tags corresponding to levels
2. In `Factor`:
1. In `1: Factor`
- `Specify a factor name`: Treatment
- In `Factor level`:
- In `1: Factor level`:
- `Specify a factor level`: treated
- `Counts file(s)`: the 3 gene count files with `treat` in their name
- In `2: Factor level`:
- `Specify a factor level`: untreated
- `Counts file(s)`: the 4 gene count files with `untreat` in their name
2. Click on `Insert Factor` (not on `Insert Factor level`)
3. In `2: Factor`
- `Specify a factor name` to Sequencing
- In `Factor level`:
- In `1: Factor level`:
- `Specify a factor level`: Paired
- `Counts file(s)`: the 4 gene count files with `paired` in their name
- In `2: Factor level`:
- `Specify a factor level`: Single
- `Counts file(s)`: the 3 gene count files with `single` in their name
3. `Files have header?`: Yes
4. `Output normalized counts table`: Yes
5. `Execute`

# Analysis of differential gene expression in PRJNA630433 using `DESeq2`

## DESeq2 Analysis

To begin, navigate to the history `PRJNA630433 FeatureCounts Counting on HISAT2 bam
alignments` and copy the three dataset collections of counts generated by FeatureCounts:
`Dc FeatureCounts counts`, `Mo FeatureCounts counts` and `Oc FeatureCounts counts` into a
new history that you will name `PRJNA630433 DESeq2 analysis`

Then, search for `DESeq2` in the tool search bar

!!! info "![](images/tool_small.png){width="25" align="absbottom"} `DESeq2` settings"
- how

--> Select datasets per levels
- 1: Factor

--> Tissue
- 1: Factor level

Note that there will be three factor levels in this analysis: Dc, Mo and Oc.

--> Oc

- Counts file(s)

--> select the data collection icon, then `15: Oc FeatureCounts counts`
- 2: Factor level

--> Mo

- Counts file(s)

--> select the data collection icon, then `10: Mo FeatureCounts counts`
- 3: Factor level (you must click on :heavy_plus_sign: `Insert Factor level`)

--> Dc

- Counts file(s)

--> select the data collection icon, then `5: Mo FeatureCounts counts`
- (Optional) provide a tabular file with additional batch factors to include in the model.

--> Leave to `Nothing selected`
- Files have header?

--> Yes
- Choice of Input data

--> Count data
- Advanced options

--> No, leave folded
- Output options

--> Unfold and check `Output all levels vs all levels of primary factor (use when
you have >2 levels for primary factor)` in addition to the already checked
`Generate plots for visualizing the analysis results`

--> Leave `Alpha value for MA-plot` to 0,1: note that this option is used for
plots and does not impact DESeq2 results
- `Run Tool`

??? warning "Note on the order of Factors levels in the DESeq2 html form"
As specified in the help section of the DESeq2 html form, the order of the Factors
levels matters ! See why in that section.

In a nutshell, the Factor level you put as last in the form, will be taken as the
reference Factor level.

Thus in our use case, the condition `Mo` will serve as reference condition for
differential gene expression in the DESeq2 analysis.

## Inspect DESeq2 plots

There is a lot of information here which we will discuss online or in live

## Add a missing header to DESeq2 tabular outputs

If you have a look to three datasets in the collection `DESeq2 result files on data 4,
data 3, and others`, you'll see that a header indicating what is the content of the 7
columns is missing. This lack of header is inconfortable when you are not very familiar
with DE analyses.

Indeed, this header should be
```
GeneID Base_mean log2FC StdErr Wald-Stats P-value P-adj
```

Fortunately, there is a nice tool in Galaxy to quickly add this header.

!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Add Header` settings"
- List of Column headers (comma delimited, e.g. C1,C2,...)

--> `GeneID,Base_mean,log2FC,StdErr,Wald-Stats,P-value,P-adj`
- Data File (tab-delimted)

--> Select the data collection icon, then `DESeq2 result files on data 4, data 3,
and others`
- `Run Tool`

:warning: Rename the new collection `DESeq2 Results Tables`

---
Loading

0 comments on commit a063047

Please sign in to comment.