Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RNAseq ioc 1 #72

Merged
merged 2 commits into from
Sep 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions docs/bulk_RNAseq-IOC/01_IOC_RNAseq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
## Introduction to the IOC ARTbio 064: Bulk RNAseq Analyses
**November 2023**

In this Interactive Online Companionship, we will train to perform RNAseq analyses
of Bulk RNAseq

### Program / Schedule

### Week 1

**3-hours Zoom video-conference with**

1. Introduction of the Companions and Instructors (10 min)
- Presentation of the IOC general workflow (Scheme) (15 min)
- Presentation of the IOC tools (2 hours)
1. Zoom (5 min)
- Starbio (5 min)
- Slack (10 min)
- GitHub (20 min)
- Psilo storage (15 min)
- Galaxy (65 min)
<!-- Ici on est à 2:25, faire un schedule sur google sheets -->
<ol start=4>
<li> Import data from Psilo to Galaxy
<li> Program of the week 2
<ol start="a">
<li> Presentation of exercises with digital tools
<li> presentation of pretreatment and metadata organisations and of related tasks to be done
</ol>
</ol>

### Week 2
1. Question on Week 2
1. Data upload
2. Quality control
- Program of Week 3
1. reference datasets (GTF, genome, subset, ucsc tables, ensembl Biomart)
### Week 3
2. Questions on Week 2
1. reference
- GTF manipulation
- Program of the Week 3
1. Mapping and mappers
2. Inspection of Bam files

3. Analysis of the differential gene expression
1. Count the number of reads per annotated gene
2. Viewing datasets side by side using the Scratchbook
3. Identification of the differentially expressed features
4. Visualization of the differentially expressed genes
5. Analysis of functional enrichment among the differentially expressed genes

Some parts of this IOC were inspired by
[Reference-based RNAseq analysis](https://galaxyproject.github.io/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html)
of the Galaxy Training Network (GTN)
64 changes: 64 additions & 0 deletions docs/bulk_RNAseq-IOC/Cutadapt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
![](images/galaxylogo.png)

# Filtering datasets to remove or trim low quality sequences

## This step is optional and should be performed by 50% of attendees.

## Cutadapt with single reads ![](images/tool_small.png)

----
1. Create a new history `Cutapdapt` (`wheel` --> `Create New`) ![](images/wheel.png)
2. Copy the fastq files from the RNAseq data library to this new history (`wheel` --> `Copy datasets`)
3. Select the `Cutadapt` tool
4. Start with selecting `Single-end` in the `Single-end or Paired-end reads?` menu
5. Select the multiple datasets button for this menu
6. Cmd-Click for discontinuous multiple selection of `single` fastq.gz files (3 datasets)
7. `Filter Options`
- `Minimum length`: 20
8. `Read Modification Options`
- `Quality cutoff`: 20
9. `Output Options`
- `Report`: Yes
10. Do not change the other available parameters and click `Execute`
----

## Cutadapt with paired-end reads ![](images/tool_small.png)

----
Repeat the same procedure as above, except that you select `Paired-end`in step 4:
Re-Run the tool using the re-run button on one Cutadapt instance and just select `Paired-end`
instead of `Single-end`

- Then you have two input boxes, one for file #1 and one for file #2.

- In the box `file #1` click the `multiple datasets` button and carefully Select
the fastq.gz files with the `_1` suffix

- In the box `file #2` click the `multiple datasets` button and carefully Select
the fastq.gz files with the `_2` suffix

- Do not change the other parameters (they are set to the same value as previously because
you used the re-run button).

- Click the `Execute` button

----

## Run MultiQC on Cutadapt jobs ![](images/tool_small.png)

----
1. Select `MultiQC` tool
2. Select `Cutadapt/Trim Galore!` in the menu `Which tool was used generate logs?`
3. Cmd-Select the `Report` datasets generated by Cutadapt
4. Press `Execute`
5. Now, the boring but essential job: Rename carefully the `Output` datasets generated
by Cutadapt. To do so, help yourself to the `Info` button at the bottom of dataset green
boxes. ![](images/info.png)

Example: Rename `Cutadapt on data 10 and data 9: Read 2 Output` in `GSM461181_2_treat_paired.fastq.gz`

6. Trash the 11 unfiltered/trimmed fastq.gz files. This is important to avoid mixing
filtered and non filtered datasets in the next steps.
----


36 changes: 36 additions & 0 deletions docs/bulk_RNAseq-IOC/DEDESeq2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
![](images/galaxylogo.png)

# `DESeq2`

----
![](images/tool_small.png)

1. Let's create a clean fresh history (`wheel` --> `Create New`) and name it DESeq2 ![](images/wheel.png)
2. Copy the `.Counts`datasets from your `STAR`/ `HISAT2` history to this new history
(`wheel` --> `Copy datasets`)
3. Select the `DESeq2` tool with the following parameters:
1. `how`: Select group tags corresponding to levels
2. In `Factor`:
1. In `1: Factor`
- `Specify a factor name`: Treatment
- In `Factor level`:
- In `1: Factor level`:
- `Specify a factor level`: treated
- `Counts file(s)`: the 3 gene count files with `treat` in their name
- In `2: Factor level`:
- `Specify a factor level`: untreated
- `Counts file(s)`: the 4 gene count files with `untreat` in their name
2. Click on `Insert Factor` (not on `Insert Factor level`)
3. In `2: Factor`
- `Specify a factor name` to Sequencing
- In `Factor level`:
- In `1: Factor level`:
- `Specify a factor level`: Paired
- `Counts file(s)`: the 4 gene count files with `paired` in their name
- In `2: Factor level`:
- `Specify a factor level`: Single
- `Counts file(s)`: the 3 gene count files with `single` in their name
3. `Files have header?`: Yes
4. `Output normalized counts table`: Yes
5. `Execute`

36 changes: 36 additions & 0 deletions docs/bulk_RNAseq-IOC/DE_intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Analysis of the differential gene expression using `DESeq2`

![](images/lamp.png)

----

DESeq2 is a great tool for Differential Gene Expression (DGE) analysis.
It takes read counts and combines them into a table (with genes in the rows and samples in the columns).
Importantly, it applies size factor normalization by:

- Computing for each gene the geometric mean of read counts across all samples
- Dividing every gene count by the geometric mean accross samples
- Using the median of these ratios as a sample’s size factor for normalization

Multiple factors with several levels can then be incorporated in the analysis.
After normalization we can compare the response of the expression of any gene to
the presence of different levels of a factor in a statistically reliable way.

In our example, we have samples with two varying factors that can contribute to
differences in gene expression:

- Treatment (either treated or untreated)
- Sequencing type (paired-end or single-end)

Here, treatment is the primary factor that we are interested in.

The sequencing type is further information we know about the data that might affect
the analysis. Multi-factor analysis allows us to assess the effect of the treatment,
while taking the sequencing type into account too.

```
We recommend that you add as many factors as you think may affect gene expression in
your experiment. It can be the sequencing type like here, but it can also be the
manipulation (if different persons are involved in the library preparation),
other batch effects, etc…
```
117 changes: 117 additions & 0 deletions docs/bulk_RNAseq-IOC/DEseq2visu.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
![](images/galaxylogo.png)

# Visualisation of differential expression

Now we would like to extract the most differentially expressed genes due to the treatment,
and then visualize them using an heatmap of the normalized counts and also
the z-score for each sample.

We will proceed in several steps:

- Extract the most differentially expressed genes using the DESeq2 summary file
- Extract the normalized counts for these genes for each sample, using the normalized count file generated by DESeq2
- Plot the heatmap of the normalized counts
- Compute the Z score of the normalized counts
- Plot the heatmap of the Z score of the genes

## Extract the most differentially expressed genes

----
![](images/tool_small.png)

1. Select the tool `Filter data on any column using simple expressions` to extract genes with a significant change in gene expression (adjusted p-value below 0.05) between treated and untreated samples:
1. `Filter`: the DESeq2 result file
2. `With following condition`: c7<0.05

The file with the independent filtered results can be used for further downstream analysis
as it excludes genes with only few read counts as these genes will not be considered as significantly differentially expressed.

The generated file contains too many genes (632/STAR, ) to get a meaningful heatmap. Therefore, in the next step,
we will take only the genes with an absolute fold change > 2 (log2(fold change) > 1)

----
![](images/tool_small.png)

1. Select the tool `Filter data on any column using simple expressions`
1. `Filter`: the differentially expressed genes (output of previous `Filter` tool)
2. `With following condition`: abs(c3)>1

We now have a table with 84/STAR, /HISAT2 lines corresponding to the most differentially expressed genes.
And for each of the gene, we have its id, its mean normalized counts (averaged over all
samples from both conditions), its log2FC and other information.

We could plot the log2FC for the different genes, but here we would like to look at a
heatmap of expression for these genes in the different samples. So we need to extract the
normalized counts for these genes.

We will join the normalized count table generated by DESeq2 with the table we just generated,
to conserve only the lines corresponding to the most differentially expressed genes.

## Extract the normalized counts of the most differentially expressed genes

----
![](images/tool_small.png)

- Create a Pasted Entry from the header line of the Filter output:

1. Copy the header of the final Filter output
2. Using the Upload tool select Paste/Fetch data and paste the copied data
3. *Set the Type to tabular* and select Start to upload a new Pasted Entry

----
![](images/tool_small.png)

- Concatenate datasets tool to add this header line to the Filter output:
1. select the `Concatenate datasets tail-to-head` tool
2. select the Pasted entry dataset
3. `+ Insert Dataset`
4. select the final `Filter output`

This ensures that the table of most differentially expressed genes has a header line and can be used in the next step.

----
![](images/tool_small.png)

- join the normalized count table generated by DESeq2 with the table we just generated,
to conserve only the lines corresponding to the most differentially expressed genes

1. select the `Join two Datasets side by side on a specified field` tool
- `Join`: the Normalized counts file (output of DESeq2 tool)
- `using column`: Column: 1
- `with`: most differentially expressed genes (output of the Concatenate tool tool)
- `and column`: Column: 1
- `Keep lines of first input that do not join with second input`: No
- `Keep the header lines`: Yes

The generated file has more columns than we need for the heatmap. In addition to the columns
with mean normalized counts, there is the log2FC and other information.
We need to remove the extra columns.

----
![](images/tool_small.png)

- Cut tool to extract the columns with the gene ids and normalized counts:

1. Select the `Cut columns from a table`tool
- `Cut columns`: c1-c8
- `Delimited by`: Tab
- `From`: the joined dataset (output of Join two Datasets tool)

We now have a table with 85 lines (the most differentially expressed genes)
and the normalized counts for these genes in the 7 samples.

----
![](images/tool_small.png)

- Plot the heatmap of the normalized counts of these genes for the samples

1. Select the `heatmap2` tool to plot the heatmap:
- `Input should have column headers`: the generated table (output of Cut tool)
- `Data transformation`: **Log2(value+1)** transform my data
- `Enable data clustering`: Yes
- `Labeling columns and rows`: Label columns and not rows
- `Coloring groups`: Blue to white to red

You should obtain something similar to:

![](images/cluster.png)
17 changes: 17 additions & 0 deletions docs/bulk_RNAseq-IOC/GO-intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
![](images/lamp.png)

# Analysis of functional enrichment among the differentially expressed genes

We have extracted genes that are differentially expressed in treated (Pasilla gene-depleted)
samples compared to untreated samples. We would like to know if there are categories of
genes that are enriched among the differentially expressed genes.

Gene Ontology (GO) analysis is widely used to reduce complexity and highlight biological
processes in genome-wide expression studies.

However, standard methods give biased results on RNA-seq data due to over-detection
of differential expression for long and highly-expressed transcripts.

The goseq tool provides methods for performing GO analysis of RNA-seq data,
taking length bias into account. The methods and software used by goseq are equally
applicable to other category based tests of RNA-seq data, such as KEGG pathway analysis.
Loading