diff --git a/docs/bulk_RNAseq-IOC/21_FeatureCounts.md b/docs/bulk_RNAseq-IOC/21_FeatureCounts.md index 9257afab5..d373c4945 100644 --- a/docs/bulk_RNAseq-IOC/21_FeatureCounts.md +++ b/docs/bulk_RNAseq-IOC/21_FeatureCounts.md @@ -47,7 +47,7 @@ and - Advanced options --> Leave folded, no advanced options ! - - `Execute` + - `Run Tool` ### Repeat the exact same operation twice for the collections Mo and Oc HISAT2 alignments diff --git a/docs/bulk_RNAseq-IOC/24_exercices_week_03_review.md b/docs/bulk_RNAseq-IOC/24_exercices_week_03_review.md index 031c71454..1cd34f79d 100644 --- a/docs/bulk_RNAseq-IOC/24_exercices_week_03_review.md +++ b/docs/bulk_RNAseq-IOC/24_exercices_week_03_review.md @@ -1,30 +1,7 @@ -## Issues with Slack ? +## Issue with FeatureCounts ? -## Issues with GitHub ? -- [x] Does everyone have a GitHub ID ? -- [x] Was everyone able to create a readme file and make a pull request to the repository - [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ? -- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have - generated during the first online meeting, with an extension .ga) and to add it in - the repository - [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ? +## Issue with htseq-count ? -## Data upload in PSILO, then in Galaxy from Psilo -- [x] Did everyone upload the necessary data in its - [PSILO account](https://psilo.sorbonne-universite.fr) ? -- [x] Did everyone succeed to create direct download links ? -- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset` - in its Galaxy account ? +## Issue with STAR counts ? -## Issues following the Galaxy training ? - -[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html) - -- Check whether `Relabel identifiers` tool is understood - -- Check whether `Extract element identifiers` tool is understood. Is the output dataset - from this tool uploaded in the appropriate GitHub folder ? - -## Check input datasets histories of the participants - -... and their ability to create appropriate collection for the analysis +## Issue STAR count formating ? diff --git a/docs/bulk_RNAseq-IOC/25_DE_intro.md b/docs/bulk_RNAseq-IOC/25_DE_intro.md index c943a2ee3..67a10b244 100644 --- a/docs/bulk_RNAseq-IOC/25_DE_intro.md +++ b/docs/bulk_RNAseq-IOC/25_DE_intro.md @@ -1,36 +1,107 @@ -# Analysis of the differential gene expression using `DESeq2` +# ![](images/lamp.png){align="absbottom} Statistical analysis of Differential Gene Expression -![](images/lamp.png) +## The basic ideas of Differential expression analysis ----- +### Gene Counts are Observations of Variables +We consider that each gene is a ==variable==. Thus DE analysis is dealing with +==multiple== variables (10 to several tens of thousands). -DESeq2 is a great tool for Differential Gene Expression (DGE) analysis. -It takes read counts and combines them into a table (with genes in the rows and samples in the columns). -Importantly, it applies size factor normalization by: +Each read count is an ==observation== of these variables. Thus, for instance, if your +experiment is based on biological triplicates in two conditions, you have three +observation of your gene variables under 2 different conditions (6 observations in total). -- Computing for each gene the geometric mean of read counts across all samples -- Dividing every gene count by the geometric mean accross samples -- Using the median of these ratios as a sample’s size factor for normalization +### Testing +When we say that we are testing for differential expression, we mean that we are +performing multiple statistical tests, ==one for each== gene variable. These tests are +well established mathematical treatments such as the Student Test (t-test), the +Mann-Whitney U test, the Wilcoxon test or the exact Fisher test, for the most used tests. +However, note that not all these tests are suitable for discrete count variables. -Multiple factors with several levels can then be incorporated in the analysis. -After normalization we can compare the response of the expression of any gene to -the presence of different levels of a factor in a statistically reliable way. +### Conditions for testing +As you probably learned during your university studies each of these tests have underlying +assumptions. The parametric tests require the a-priori knowledge of parameters (mean, +variance, etc.), or that the distribution of the tested variable follows a specific law +(normality, continuity, etc.). For instance the parametric student test requires that the +means of observation is normally distributed, which is the case if the number of +observation is > 30 (Central Limit Theorem) or if the variable is normally distributed +(which is not the case for a read count variable !) -In our example, we have samples with two varying factors that can contribute to -differences in gene expression: +### Read counts from NGS sequencing are not normally distributed -- Treatment (either treated or untreated) -- Sequencing type (paired-end or single-end) +In contrast to the intensity of a probe in a microarray, a read count variable is does not +follow and gaussian (normal) distribution ! Statistician showed that read counts +variables, which are discrete variable, follow a generalized Poisson distribution, which +can be approximated by an inverse binomial distribution (also referred to as negative +binomial, NB) when the number of observations is low. -Here, treatment is the primary factor that we are interested in. +From this, it comes that the main tools for NGS DE analysis model read count variables with +a Poisson (Limma) or a NB law, and use specific tests for differential expression. -The sequencing type is further information we know about the data that might affect -the analysis. Multi-factor analysis allows us to assess the effect of the treatment, -while taking the sequencing type into account too. +Note that these tests are ==parametric test==, which implies that the Mean and the Variance +must be approximated before the tests. In the case of the Negative Binomial distribution, +Mean == Variance. -``` -We recommend that you add as many factors as you think may affect gene expression in -your experiment. It can be the sequencing type like here, but it can also be the -manipulation (if different persons are involved in the library preparation), -other batch effects, etc… -``` +### Shared information between read count variables + +Although differential Expression analysis is based on the assumption that gene expression +variables are **independent**, it happens that these variables share information which can be +used for better modeling of test parameters *for each test*. + +Thus, the main benefit from using Limma, DESeq or edgeR packages is this modeling operation +which improves the accuracy of the statistical tests for differential expression. + +### Multiple Testing + +Each test for DE gives rise to a p-value, which is the probability of wrongly rejecting +hypothesis H0 which is, remember, that there is no difference in gene expression in view +of the observations of the number of reads. + +For instance, when you read p = 0.05, this implies that the gene is differentially +expressed, with the probability that this conclusion is false being < 0.05. This is the +type I error. + +However, transcriptome analysis implies thousands of tests. It happens that these tests +also follow a statistical law ! Even if a given test returns a p-value < 0.05, there is, +in addition a probability that this test was wrong ! + +Thus, when you perform thousands of test, you know that a proportion of these tests will +return wrong p-values. + +The adjustment of the p-values seeds from this situation: you **must** correct your +p-values for multiple testing, because in the context of dozen of thousands tests, +p-values are poor indicators and does not allow to control the False Discovery Rate (type +I error) + +Several methods exist for this correction. The Bonferroni correction is popular and +relatively conservative, whereas the Benjamini and Hochberg correction, which controls a +priori the False Discovery Rate is considered as less stringent. We can also cite other +methods that are not widely used in Biology such as the Bonferroni Step-down (Holm) +correction or the Westfall and Young Permutation. + +### Normalization + +Last, but certainly not the least, to test a read count variable for differential +expression, a Normalization operation must be performed, since different sequencing depth +lead to different estimation of gene expression ! + +This Normalisation is performed differently by the Limma, DESeq or edgeR packages, which +is responsible a significant part of the differences between the packages. + +### R packages used in this companionship + +We are going to use most popular R packages DESeq2 and edgeR and it will be interesting to +compare the results returned by both packages. We will also try to give a shot to Limma +which was a very popular packages for analysing microarray results. Interestingly, Limma +has evolved and incorpores several methods to adapt to the more recent NGS RNAseq results. + +### References + +- **DESeq2**: Anders and Huber, Genome Biology 2010, 11:R106 + [DOI](https://doi.org/10.1186/gb-2010-11-10-r106) +- **edgeR**: Robinson, McCarthy and Smyth, + Bioinformatics 2010, 26 p 139 [DOI](https://doi.org/10.1093/bioinformatics/btp616) +- **Limma**: Ritchie, Phipson, Wu, Hu, Law, Shi, et al., Nucleic Acids Res. 2015;43: e47. + [DOI](https://doi.org/10.1093/nar/gkv007) + + +--- diff --git a/docs/bulk_RNAseq-IOC/26_deseq2.md b/docs/bulk_RNAseq-IOC/26_deseq2.md index 0f3c52d73..f30873279 100644 --- a/docs/bulk_RNAseq-IOC/26_deseq2.md +++ b/docs/bulk_RNAseq-IOC/26_deseq2.md @@ -1,36 +1,104 @@ -![](images/galaxylogo.png) - -# `DESeq2` - ----- -![](images/tool_small.png) - - 1. Let's create a clean fresh history (`wheel` --> `Create New`) and name it DESeq2 ![](images/wheel.png) - 2. Copy the `.Counts`datasets from your `STAR`/ `HISAT2` history to this new history - (`wheel` --> `Copy datasets`) - 3. Select the `DESeq2` tool with the following parameters: - 1. `how`: Select group tags corresponding to levels - 2. In `Factor`: - 1. In `1: Factor` - - `Specify a factor name`: Treatment - - In `Factor level`: - - In `1: Factor level`: - - `Specify a factor level`: treated - - `Counts file(s)`: the 3 gene count files with `treat` in their name - - In `2: Factor level`: - - `Specify a factor level`: untreated - - `Counts file(s)`: the 4 gene count files with `untreat` in their name - 2. Click on `Insert Factor` (not on `Insert Factor level`) - 3. In `2: Factor` - - `Specify a factor name` to Sequencing - - In `Factor level`: - - In `1: Factor level`: - - `Specify a factor level`: Paired - - `Counts file(s)`: the 4 gene count files with `paired` in their name - - In `2: Factor level`: - - `Specify a factor level`: Single - - `Counts file(s)`: the 3 gene count files with `single` in their name - 3. `Files have header?`: Yes - 4. `Output normalized counts table`: Yes - 5. `Execute` - +# Analysis of differential gene expression in PRJNA630433 using `DESeq2` + +## DESeq2 Analysis + +To begin, navigate to the history `PRJNA630433 FeatureCounts Counting on HISAT2 bam +alignments` and copy the three dataset collections of counts generated by FeatureCounts: +`Dc FeatureCounts counts`, `Mo FeatureCounts counts` and `Oc FeatureCounts counts` into a +new history that you will name `PRJNA630433 DESeq2 analysis` + +Then, search for `DESeq2` in the tool search bar + +!!! info "![](images/tool_small.png){width="25" align="absbottom"} `DESeq2` settings" + - how + + --> Select datasets per levels + - 1: Factor + + --> Tissue + - 1: Factor level + + Note that there will be three factor levels in this analysis: Dc, Mo and Oc. + + --> Oc + + - Counts file(s) + + --> select the data collection icon, then `15: Oc FeatureCounts counts` + - 2: Factor level + + --> Mo + + - Counts file(s) + + --> select the data collection icon, then `10: Mo FeatureCounts counts` + - 3: Factor level (you must click on :heavy_plus_sign: `Insert Factor level`) + + --> Dc + + - Counts file(s) + + --> select the data collection icon, then `5: Mo FeatureCounts counts` + - (Optional) provide a tabular file with additional batch factors to include in the model. + + --> Leave to `Nothing selected` + - Files have header? + + --> Yes + - Choice of Input data + + --> Count data + - Advanced options + + --> No, leave folded + - Output options + + --> Unfold and check `Output all levels vs all levels of primary factor (use when + you have >2 levels for primary factor)` in addition to the already checked + `Generate plots for visualizing the analysis results` + + --> Leave `Alpha value for MA-plot` to 0,1: note that this option is used for + plots and does not impact DESeq2 results + - `Run Tool` + +??? warning "Note on the order of Factors levels in the DESeq2 html form" + As specified in the help section of the DESeq2 html form, the order of the Factors + levels matters ! See why in that section. + + In a nutshell, the Factor level you put as last in the form, will be taken as the + reference Factor level. + + Thus in our use case, the condition `Mo` will serve as reference condition for + differential gene expression in the DESeq2 analysis. + +## Inspect DESeq2 plots + +There is a lot of information here which we will discuss online or in live + +## Add a missing header to DESeq2 tabular outputs + +If you have a look to three datasets in the collection `DESeq2 result files on data 4, +data 3, and others`, you'll see that a header indicating what is the content of the 7 +columns is missing. This lack of header is inconfortable when you are not very familiar +with DE analyses. + +Indeed, this header should be +``` +GeneID Base_mean log2FC StdErr Wald-Stats P-value P-adj +``` + +Fortunately, there is a nice tool in Galaxy to quickly add this header. + +!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Add Header` settings" + - List of Column headers (comma delimited, e.g. C1,C2,...) + + --> `GeneID,Base_mean,log2FC,StdErr,Wald-Stats,P-value,P-adj` + - Data File (tab-delimted) + + --> Select the data collection icon, then `DESeq2 result files on data 4, data 3, + and others` + - `Run Tool` + +:warning: Rename the new collection `DESeq2 Results Tables` + +--- diff --git a/docs/bulk_RNAseq-IOC/27_1_limma.md b/docs/bulk_RNAseq-IOC/27_1_limma.md new file mode 100644 index 000000000..1834976a8 --- /dev/null +++ b/docs/bulk_RNAseq-IOC/27_1_limma.md @@ -0,0 +1,96 @@ +# Analysis of differential gene expression in PRJNA630433 using `limma` + +## limma Analysis + +To begin, navigate to the history `PRJNA630433 FeatureCounts Counting on HISAT2 bam +alignments` and copy the three dataset collections of counts generated by FeatureCounts: +`Dc FeatureCounts counts`, `Mo FeatureCounts counts` and `Oc FeatureCounts counts` into a +new history that you will name `PRJNA630433 limma analysis` + +Then, search for `limma` in the tool search bar + +!!! info "![](images/tool_small.png){width="25" align="absbottom"} `limma` settings" + - Differential Expression Method + + --> `limma-voom` + - Apply voom with sample quality weights? + + --> `No` + - Count Files or Matrix? + + --> Separate Count Files + - 1: Factor/Name + + --> Tissue + - 1: Factor/1: Group + + Note that there will be three Groups (ie factor levels) in this analysis: Dc, Mo and Oc. + + --> Oc + + - Counts file(s) + + --> select the data collection icon, then `15: Oc FeatureCounts counts` + - 2: Factor/2: Group + + --> Mo + + - Counts file(s) + + --> select the data collection icon, then `10: Mo FeatureCounts counts` + - 3: Factor level (you must click on :heavy_plus_sign: `Insert Group`) + + --> Dc + + - Counts file(s) + + --> select the data collection icon, then `5: Mo FeatureCounts counts` + - Use Gene Annotations? + + --> `No` + - Input Contrast information from file? + + --> `No` + - 1: Constrast + + --> `Mo-Dc` + - 2: Constrast (click :heavy_plus_sign: `Insert Contrast`) + + --> `Oc-Dc` + - 3: Constrast (click :heavy_plus_sign: `Insert Contrast`) + + --> `Oc-Mo` + - Filter Low Counts + + --> No, leave folded + - Output options + + --> Leave folded + - Advanced options + + --> Put `P-Value Adjusted Threshold` to 0.1 (to be consistent with DESeq settings) + + --> Leave other advanced options unchanged + - `Run Tool` + +??? warning "Note on the order of Factors levels (Groups) in the limma html form" + In contrast to DESeq2, the order of the Factors levels (Groups) does not matter with + the limma approach. + + This is because here you specify manually the comparison formulas. Yet, in these + formula, the order of the levels matters ! + + Thus when we specify `Mo-Dc` this implies specifically that we consider the Dc as the + reference level: we "subtract" the test level `Mo` from the reference level `Dc` + +## Inspect limma plots + +There is a lot of information here which we will discuss online or in live. You should +also compare these plots side by side with the plots generated by edgeR or DESeq, +especially edgeR since the format of plot reporting is very similar between limma and +edgeR. + +## Here, no need for adding header to limma tabular outputs ! +However, note that the loom headers are not exactly the same as the edgeR or DESeq2 headers. + +--- \ No newline at end of file diff --git a/docs/bulk_RNAseq-IOC/27_edgeR.md b/docs/bulk_RNAseq-IOC/27_edgeR.md index 0f3c52d73..9760862aa 100644 --- a/docs/bulk_RNAseq-IOC/27_edgeR.md +++ b/docs/bulk_RNAseq-IOC/27_edgeR.md @@ -1,36 +1,90 @@ -![](images/galaxylogo.png) - -# `DESeq2` - ----- -![](images/tool_small.png) - - 1. Let's create a clean fresh history (`wheel` --> `Create New`) and name it DESeq2 ![](images/wheel.png) - 2. Copy the `.Counts`datasets from your `STAR`/ `HISAT2` history to this new history - (`wheel` --> `Copy datasets`) - 3. Select the `DESeq2` tool with the following parameters: - 1. `how`: Select group tags corresponding to levels - 2. In `Factor`: - 1. In `1: Factor` - - `Specify a factor name`: Treatment - - In `Factor level`: - - In `1: Factor level`: - - `Specify a factor level`: treated - - `Counts file(s)`: the 3 gene count files with `treat` in their name - - In `2: Factor level`: - - `Specify a factor level`: untreated - - `Counts file(s)`: the 4 gene count files with `untreat` in their name - 2. Click on `Insert Factor` (not on `Insert Factor level`) - 3. In `2: Factor` - - `Specify a factor name` to Sequencing - - In `Factor level`: - - In `1: Factor level`: - - `Specify a factor level`: Paired - - `Counts file(s)`: the 4 gene count files with `paired` in their name - - In `2: Factor level`: - - `Specify a factor level`: Single - - `Counts file(s)`: the 3 gene count files with `single` in their name - 3. `Files have header?`: Yes - 4. `Output normalized counts table`: Yes - 5. `Execute` - +# Analysis of differential gene expression in PRJNA630433 using `edgeR` + +## edgeR Analysis + +To begin, navigate to the history `PRJNA630433 FeatureCounts Counting on HISAT2 bam +alignments` and copy the three dataset collections of counts generated by FeatureCounts: +`Dc FeatureCounts counts`, `Mo FeatureCounts counts` and `Oc FeatureCounts counts` into a +new history that you will name `PRJNA630433 edgeR analysis` + +Then, search for `edgeR` in the tool search bar + +!!! info "![](images/tool_small.png){width="25" align="absbottom"} `edgeR` settings" + - Count Files or Matrix? + + --> Separate Count Files + - 1: Factor/Name + + --> Tissue + - 1: Factor/1: Group + + Note that there will be three Groups (ie factor levels) in this analysis: Dc, Mo and Oc. + + --> Oc + + - Counts file(s) + + --> select the data collection icon, then `15: Oc FeatureCounts counts` + - 2: Factor/2: Group + + --> Mo + + - Counts file(s) + + --> select the data collection icon, then `10: Mo FeatureCounts counts` + - 3: Factor level (you must click on :heavy_plus_sign: `Insert Group`) + + --> Dc + + - Counts file(s) + + --> select the data collection icon, then `5: Mo FeatureCounts counts` + - Use Gene Annotations? + + --> `No` + - Formula for linear model + + --> Leave empty + - Input contrasts manually or through a file + + --> `manually` + - 1: Constrast + + --> `Mo-Dc` + - 2: Constrast (click :heavy_plus_sign: `Insert Contrast`) + + --> `Oc-Dc` + - 3: Constrast (click :heavy_plus_sign: `Insert Contrast`) + + --> `Oc-Mo` + - Filter Low Counts + + --> No, leave folded + - Output options + + --> Leave folded + - Advanced options + + --> Put `P-Value Adjusted Threshold` to 0.1 (to be consistent with DESeq settings) + + --> Leave other advanced options unchanged + - `Run Tool` + +??? warning "Note on the order of Factors levels (Groups) in the edgeR html form" + In contrast to DESeq2, the order of the Factors levels (Groups) does not matter with + the edgeR approach. + + This is because here you specify manually the comparison formulas. Yet, in these + formula, the order of the levels matters ! + + Thus when we specify `Mo-Dc` this implies specifically that we consider the Dc as the + reference level: we "subtract" the test level `Mo` from the reference level `Dc` + +## Inspect edgeR plots + +There is a lot of information here which we will discuss online or in live. You should +also compare these plots side by side with the plots generated by DESeq2. + +## Here, no need for adding header to edgeR tabular outputs ! + +--- \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 1e8a3d210..dfdc8e1ae 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -78,11 +78,12 @@ nav: - Week 4: - - Review on week-2 work: bulk_RNAseq-IOC/24_exercices_week_03_review.md + - Review on week-3 work: bulk_RNAseq-IOC/24_exercices_week_03_review.md - Differential Gene Expression Analysis: bulk_RNAseq-IOC/25_DE_intro.md - Week 3 exercices: - DESeq2: bulk_RNAseq-IOC/26_deseq2.md - - EdgeR: bulk_RNAseq-IOC/27_edgeR.md + - edgeR: bulk_RNAseq-IOC/27_edgeR.md + - limma: bulk_RNAseq-IOC/27_1_limma.md - Visualisations: bulk_RNAseq-IOC/28_visualisations.md - Volcano plot: bulk_RNAseq-IOC/29_volcano.md