From 3e4fc85661bac056003a6d0a2962fb65ee256f50 Mon Sep 17 00:00:00 2001 From: drosofff Date: Mon, 26 Feb 2024 10:06:47 +0100 Subject: [PATCH] finished with GOseq part --- .../bulk_RNAseq-IOC/31_GO_enrichment_intro.md | 103 ++++++- docs/bulk_RNAseq-IOC/32_GOseq.md | 283 ++++++++++++++---- 2 files changed, 315 insertions(+), 71 deletions(-) diff --git a/docs/bulk_RNAseq-IOC/31_GO_enrichment_intro.md b/docs/bulk_RNAseq-IOC/31_GO_enrichment_intro.md index 932dff39a..1921ecf91 100644 --- a/docs/bulk_RNAseq-IOC/31_GO_enrichment_intro.md +++ b/docs/bulk_RNAseq-IOC/31_GO_enrichment_intro.md @@ -1,17 +1,96 @@ -![](images/lamp.png) +## Analyzing GO Enrichment from DEGs -# Analysis of functional enrichment among the differentially expressed genes +Gene Ontology (GO) enrichment analysis is used to +identify `biological processes`, `cellular components`, and `molecular functions` that are +significantly over-represented (or under-represented) in a set of genes compared to a +background list. This is particularly valuable when analyzing differentially expressed +genes (DEGs) identified from RNA-seq or microarray experiments. -We have extracted genes that are differentially expressed in treated (Pasilla gene-depleted) -samples compared to untreated samples. We would like to know if there are categories of -genes that are enriched among the differentially expressed genes. +### Individual Gene Analysis (IGA) -Gene Ontology (GO) analysis is widely used to reduce complexity and highlight biological -processes in genome-wide expression studies. +- [x] **Concept** + + This approach tests each GO term individually for enrichment within the DEG list. -However, standard methods give biased results on RNA-seq data due to over-detection -of differential expression for long and highly-expressed transcripts. +- [x] **Methods:** + * **Hypergeometric test:** Calculates the probability of observing +the number of DEGs in a specific GO term by chance. + * **Fisher's exact test:** Similar to +the hypergeometric test but suitable for smaller datasets. + +- [x] **Limitations:** + + * Ignores the hierarchical structure of GO, potentially missing related terms. + * Susceptible to multiple testing issues, requiring correction methods like Bonferroni + adjustment. -The goseq tool provides methods for performing GO analysis of RNA-seq data, -taking length bias into account. The methods and software used by goseq are equally -applicable to other category based tests of RNA-seq data, such as KEGG pathway analysis. \ No newline at end of file +- [x] **Advanced Considerations:** + + * **Multiple Testing Correction:** As mentioned, IGA is susceptible to multiple testing + issues. Here are some commonly used correction methods: + * *Bonferroni adjustment:* A conservative approach that controls the family-wise + error rate (FWER) but can be overly stringent. + * *Benjamini-Hochberg (BH) procedure:* Controls the false discovery rate (FDR) and + is less conservative than Bonferroni. + * *False discovery rate (q-value):* Provides a measure of significance adjusted + for multiple testing. + * **Gene Ontology Consortium (GOC) recommendations:** The GOC recommends using a + combination of statistical significance (p-value) and fold change thresholds to + identify relevant enriched terms, acknowledging the limitations of p-values alone. + +### Gene Set Analysis (GSA) + +- [x] **Concept:** + + Considers the entire set of DEGs and their relationships within the GO hierarchy. + +- [x] **Methods:** + + * **Pathway analysis tools:** Tools like Enrichr, clusterProfiler, and GSEA analyze + pre-defined gene sets like KEGG pathways and analyze enrichment within DEGs. + * **GO-based GSA methods:** + * **Rank-based approaches:** Assign a rank to each gene based on its differential + expression and analyze enrichment within ranked gene sets. (e.g., GSEA) + * **Permutation-based approaches:** Randomly shuffle gene labels and recalculate + enrichment scores to assess statistical significance. (e.g., fgsea) + - Tools like GOseq, fgsea, and piano utilize various + statistical models to account for the hierarchical structure of GO and identify + enriched functional categories. + +- [x] **Advantages of using GSA:** + * Incorporates information about gene relationships within the GO hierarchy, leading + to more biologically relevant insights. + * Reduces the burden of multiple testing compared to individual GO term analysis. + +### Advanced Methods for Deeper Exploration + +- [x] **Cluster enrichment analysis:** Tools like CeaGO group related GO terms based on +semantic similarity and analyze enrichment within these clusters. This approach can reveal +broader functional themes beyond individual terms. +- [x] **Network analysis:** Integrating protein-protein interaction data with GO +annotations allows identifying functionally connected subnetworks enriched in DEGs. This +provides a network-based understanding of the underlying biological processes. + +### Choosing the right method + +The choice of method depends on factors like: + +- [x] **Size of the DEG list:** For smaller lists, IGA might be sufficient, while larger lists +benefit from GSA approaches. +- [x] **Research question:** If interested in specific GO terms, +IGA might be suitable. For broader functional insights, GSA is preferred. + +### Additional considerations + +- [x] **Over-detection bias** standard methods give biased results on RNA-seq data due to +over-detection of differential expression for long and highly-expressed transcripts. The +==**goseq**== tool provides methods for performing GO analysis of RNA-seq data, taking length +bias into account. The methods and software used by goseq are equally applicable to other +category based tests of RNA-seq data, such as KEGG pathway analysis. +- [x] **Background gene list:** Choosing a relevant background list representing the genes not +differentially expressed is crucial for accurate enrichment analysis. +- [x] **Multiple testing correction:** Apply appropriate correction methods to account for +testing multiple GO terms simultaneously. +- [x] **Visualization:** Utilize graphical representations like bar +charts or heatmaps to visualize enriched GO terms and their significance levels. +--- \ No newline at end of file diff --git a/docs/bulk_RNAseq-IOC/32_GOseq.md b/docs/bulk_RNAseq-IOC/32_GOseq.md index ee11fe252..5110e3c54 100644 --- a/docs/bulk_RNAseq-IOC/32_GOseq.md +++ b/docs/bulk_RNAseq-IOC/32_GOseq.md @@ -1,59 +1,225 @@ -![](images/galaxylogo.png) - -## Prepare the datasets for GOSeq - ----- -![](images/tool_small.png) - -1. Select `Compute an expression on every row` tool with - - `Add expression`: bool(c7<0.05) - - `as a new column to`: the DESeq2 result file ----- -![](images/tool_small.png) - -1. `Cut` tool with - - `Cut columns`: c1,c8 - - `Delimited by`: Tab - - `From`: the output of the Compute tool ----- -![](images/tool_small.png) - -1. `Change Case` tool with - - `From`: the output of the previous Cut tool - - `Change case of columns`: c1 - - `Delimited by`: Tab - - `To`: Upper case +# Analysis of Gene Ontology enrichments using the Galaxy Tool `goseq` + +As mentioned previously NGS read count data are inherently biased by the length of +transcripts. + +How does this bias translate in DE tables ? + +==Not== in biased Fold Changes ! Fold changes are count ratios for a given gene, so bias +is compensated for. + +In contrast, genes with a high number of counts (whether due to a high level of expression +or a large transcript) tend to be ==over-**detected**== since the p-value returned by the +statistical test depends on the count base mean. This becomes a major issue in GO-based +Gene Set Analyses since gene sets are selected on the basis of their p-values/p-adjusted-values. + +The `goseq` tool provides methods for performing GO analysis of RNA-seq data, taking +length bias into account. + +## `goseq` analysis of the DESeq2 DE tables in the use-case PRJNA630433 + +### Gather needed data in a new history +- [x] Although we are going to focus on the DESeq2 tables, the approach would be the same +with other DE callers: feel free to test it latter on. +- [x] to perform the analysis, we will need a gene length list. This list was generated +[previously by the featureCounts tool](../21_FeatureCounts) and is therefore available in +the history `PRJNA630433 FeatureCounts Counting on HISAT2 bam alignments`. + + --> Copy one of the collection among the `Dc`, `Mo` or `Oc FeatureCounts Feature Length` + collections in this history in a new history which you will name `PRJNA630433 GOseq + analysis`. Note that Feature lenght datasets are all identical, we will just need to + extract one for the goseq analysis. + +- [x] Finally, copy the collection of DEseq2 tables from the history `PRJNA630433 DESeq2 + analysis` in the new history `PRJNA630433 FeatureCounts Counting on HISAT2 bam + alignments`. This collection should be named `DESeq2 Results Tables`. + +### Prepare the Gene Set(s) + +Since we are going to perform a GO-based Gene Set Analysis we first need to define the +Gene Sets ! We will keept the advantage of collections and treat in parallel the three +comparisons (Mo vs Dc, Oc vs Dc and Oc vs Mo). + +#### Clean up tables from `NA`s +We first clean up the Tables by removing all lines that contain `NA` values using the Tool +`Select lines that match an expression` + +This is an important step because it allows to reduce the "gene space". As mentioned in +introduction, choosing a relevant background list representing the genes ==not +differentially expressed== is crucial for accurate enrichment analysis. By removing the +lines containing NAs, we shrink our datasets and retains only genes that are accessible to +tests, ie, that are at least significantly ==expressed== in the considered tissue/experience. + +!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Select lines that match an expression` settings" + - Select lines from + + --> Click the collection icon and select `DESeq2 Results Tables` + - that + + --> `Not matching` (be careful, ==not== matching) + - the pattern + + --> `\tNA$` (we seach for lines that contain any tabulation mark, followed by `NA` and + this `NA` is a word, ie, it is followed by a space or another tabulation mark) + - Keep header line + + --> `Yes` + - `Run Tool` + +#### Add a boolean column to tag the gene Set +Now we are going to select our gene sets from DE table collection using a treatment which +evaluates a each line of the table and returns a boolean value in a new column. Genes with +a `True` value will been considered as part of the gene set, whereas genes with a `False` +value will be considered as part of the background list for the GSA. + +Let's do this, using a stringent evaluation, ie we will tag genes for which both `p-adjust +< 0.001` and `|log2FC| > 2`. In your own analyses you may have to use less stringent +filtering. It depends mostly of your datasets and whether you get a sufficiently large enough +gene set after filtering. Considere that the gene set size should be between 1 and 10 % of +your background genes. + +!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Compute on rows` settings" + - Input file + + --> Click the collection icon and select `Select on collection 9` + - Input has a header line with column names? + + --> `Yes` + - Add expression + + --> `c7 < 0.001 and abs(c3) > 2` + - Mode of the operation -This generates the first input for goseq. We need as second input for goseq, the gene lengths. -We can use there the gene length generated by featureCounts tool and reformat it a bit. - ----- -![](images/tool_small.png) - -1. Copy one output of type `...: Feature lengths` of the 7 featureCounts runs in the history `STAR`/`HISAT2` -2. Rename it `Lengths` -3. `Change Case` tool with - - `From`: the feature lengths (output of featureCounts tool) - - `Change case of columns`: c1 - - `Delimited by`: Tab - - `To`: Upper case - -We have now the two required input files for goseq. - -## Perform GO analysis - ----- -![](images/tool_small.png) - -1. Select `goseq` tool with - - `Differentially expressed genes file` : first file generated by Change Case tool on previous step - - `Gene lengths file` : second file generated by Change Case tool on previous step - - `Gene categories` : Get categories - - `Select a genome to use` : Fruit fly (dm6) - - `Select Gene ID format` : Ensembl Gene ID - - `Select one or more categories` : GO: Cellular Component, GO: Biological Process, GO: Molecular Function + --> `Append` + - The new column name -goseq generates a big table with the following columns for each GO term: + --> `boolean` + - `Run Tool` + +#### Cut columns 1 and 8 to get the final gene sets and backgrounds + +!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Advanced Cut columns from a table` settings" + - File to cut + + --> Click the collection icon and select `Compute on collection 10` + - Operation + + --> `Keep` + - Delimited by + + --> `Tab` + - Cut by + + --> `fields` + - List of Fields + + --> `boolean` + - `Run Tool` + +:warning: As this is the last step of the construction of the gene set lists, you should +rename for clarity the returned collection as `Gene Lists`. + +### Extract a single gene length table dataset +A single gene length table can be extract from the collection which we initially copied from +the history `PRJNA630433 FeatureCounts Counting on HISAT2 bam alignments` + +!!! info "![](images/tool_small.png){width="25" align="absbottom"} `Extract dataset` settings" + - Input List + + --> Click the collection icon and select `Mo featureCounts: Feature lengths` + - How should a dataset be selected? + + --> `The first dataset` + + :warning: it would work equally well with `Select by element identifier` or + `Select by index` + - `Run Tool` + +:warning: For clarity, also rename this single dataset as `Gene lengths` + +### Perform goseq analysis + +Our two inputs are now ready for goseq: +- The list of tagged genes. `True` for the gene set, `False` for the background genes +- The length of genes (aggregated size of exons) to correct for the length bias of detection. + +goseq proposes 3 types of GSA methods + +- [x] The **Wallenius method** approximates the true distribution of numbers of members + of a category amongst DE genes by the Wallenius non-central hypergeometric distribution. + This distribution assumes that within a category all genes have the same probability of + being chosen. Therefore, this approximation works best when the range in probabilities + obtained by the probability weighting function (the regression over gene lenght) is small. + This is the method specifically developed in the goseq package as well as the one we are + going to choose in this IOC. + +- [x] The **Sampling method** uses random sampling to approximate the true distribution + and uses it to calculate the p-values for over (and under) representation of categories. + Although this is the most accurate method given a high enough value of sampling number, + its use quickly becomes computationally prohibitive. It may sometimes be desirable to + use random sampling to generate the null distribution for category membership. For + example, to check consistency against results from the Wallenius approximation. This is + easily accomplished by using the method option to additionally specify sampling and the + number of samples to generate. + +- [x] The **Hypergeometric Method** assumes ==there is no bias== in power to detect + differential expression at all and calculates the p-values using a standard hypergeometric + distribution (no length bias correction is performed). Useful if you wish to test the + effect of length bias on your results. :warning: Hypergeometric method should NEVER be + used for producing results for biological interpretation of RNA-seq data. Indeed, if + length bias is truly not present in your data, goseq will produce a nearly flat PWF plot, + no length bias correction will be applied to your data, and all methods will produce the + same results. + + +!!! info "![](images/tool_small.png){width="25" align="absbottom"} `goseq tests` settings" + - Differentially expressed genes file + + --> Click the collection icon and select `Gene Lists` (the renamed collection) + - Gene lengths file + + --> `Gene lengths` (the renamed single dataset) + - Gene categories + + Because we are working here with Mus musculus, we can rely on the tool to fetch + directly the GO categories from a remote database. :warning: For *Nesseria gon* As + well as *Apis mel* you will have to construct and use your own Gene categorie file. + And... Yes, the section title `Gene categories` is not sufficiently explicit and + should rather be `GO categories` + + --> `Get categories` + - Select a genome to use + + --> `Mouse (mm10)` + - Select Gene ID format + + --> `Ensembl Gene ID` + - Select one or more categories + + --> Check only `GO: Biological Process` (for simplicity in this first run) + - Method Options + + --> Use Wallenius method `Yes` + + --> Use Hypergeometric method `No` + + --> Sampling number `5000` :warning: We also run this method to compare it with + the Wallenius method. + - Advanced Options + + --> Select a method for multiple hypothesis testing correction `Benjamini-Hochberg [FDR]` + + --> Count genes without any category? `Yes` + - Output Options + + --> Output Top GO terms plot? `Yes` + + --> Produce diagnostic plots? `Yes` + + --> Extract the DE genes for the categories (GO/KEGG terms)? `Yes` + - `Run Tool` + +`goseq` generates a big table with the following columns for each GO term: |Column | Description | |-----------------------------|-------------------------------------------------------------------------------------------------------| @@ -69,9 +235,8 @@ goseq generates a big table with the following columns for each GO term: To identify categories significantly enriched/unenriched below some p-value cutoff, it is necessary to use the adjusted p-value. -#### How many GO terms are over-represented at adjusted P value < 0.05? -#### Under-represented? - -#### How are the over-represented GO terms divided between MF, CC and BP? -#### And for under-represented GO terms? +??? question "How many GO terms are **over**-represented at adjusted P value < 0.05?" + +??? question "How many GO terms are **under**-represented at adjusted P value < 0.05?" +