Skip to content

Commit

Permalink
week 2 (mapping) (#96)
Browse files Browse the repository at this point in the history
* week 2 (mapping)

* prepack all the IOC

* Lint 01_IOC_RNAseq_week_00.md

* lint week 1 exercices

* Update 13_exercices_week_01_review.md

* week 2: progressing to week 2 exercises

* Update 00_IOC_RNAseq_program.md
  • Loading branch information
drosofff authored Jan 3, 2024
1 parent 7f4ac5f commit beeeebb
Show file tree
Hide file tree
Showing 44 changed files with 842 additions and 26 deletions.
6 changes: 2 additions & 4 deletions docs/bulk_RNAseq-IOC/00_IOC_RNAseq_program.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
In this Interactive Online Companionship which will be held in **November 2023**,
We will train to perform RNAseq analyses of Bulk RNAseq

The final schedule will be published at mid October
In this Interactive Online Companionship which will be held from **January 8th to March
18th, 2024**, we will train to perform RNAseq analyses of Bulk RNAseq

### Week 0 - **3-hours Zoom video-conference**
<!-- Faire un schedule sur google sheets -->
Expand Down
7 changes: 3 additions & 4 deletions docs/bulk_RNAseq-IOC/01_IOC_RNAseq_week_00.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Here, you'll find all weekly lessons, exercises, instructions, etc...
**Importantly**, you, yes, you, are welcome to propose modifications or fixes to the STARTbio
IOC web pages !
Assuming that during this IOC you will become familiar with the use of GitHub, all you
have to do is click on the pencil icon ![](images/github_pencil.png){width="25"}
have to do is click on the pencil icon ![](images/github_pencil.png){width="100" align="absbottom"}
at the top of each page and propose your modifications in a branch of our GitHub startbio
repository.

Expand Down Expand Up @@ -90,9 +90,8 @@ If you have already a Slack account, you can connect to this account using this
[Apple Desktop Slack](https://apps.apple.com/us/app/slack-for-desktop/id803453959?mt=12){:target="_blank"}
| [Windows Desktop Slack](https://slack.com/intl/fr-fr/downloads/windows){:target="_blank"}

Last but not least, Slack is not an option for this IOC !

We will be extremely reluctant to communicate by email with you about this IOC.
Last but not least, Slack is not an option: we will be extremely reluctant to communicate
by email with you about this IOC.

Indeed, emails capture information very poorly, because very often the subject headings
are poorly chosen (or not chosen at all...), conversations by email deal with heterogeneous
Expand Down
6 changes: 3 additions & 3 deletions docs/bulk_RNAseq-IOC/11_uploads.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ The first way to get input data in your Galaxy account is to transfer them from
==local computer== to ==Galaxy==.

Note that whereas this mode may be convenient if you have _already_ the data on your computer,
it is pretty in inefficient: it implies 2 transfers of data, first from the data
it is pretty inefficient: it implies 2 transfers of data, first from the data
source to your computer, secondly from your computer to Galaxy. When it comes to large files,
as it is the case here with the fastq file collection of PRJNA630433, it matters a lot !

Expand Down Expand Up @@ -184,7 +184,7 @@ SRR11688225 ftp.sra.ebi.ac.uk/vol1/fastq/SRR116/025/SRR11688225/SRR11688225.fast
SRR11688226 ftp.sra.ebi.ac.uk/vol1/fastq/SRR116/026/SRR11688226/SRR11688226.fastq.gz SAMN14836337 Oc rep3
SRR11688229 ftp.sra.ebi.ac.uk/vol1/fastq/SRR116/029/SRR11688229/SRR11688229.fastq.gz SAMN14836334 Oc rep4
```
f you open your tsv file (change the filename from `filereport_read_run_PRJNA630433_tsv.txt`
If you open your tsv file (change the filename from `filereport_read_run_PRJNA630433_tsv.txt`
to `filereport_read_run_PRJNA630433.tsv`) with your spreadsheet software, it is also easy
to generate three additional tables, which will be useful to you later.

Expand Down Expand Up @@ -309,7 +309,7 @@ To finish with this tool, you probably noticed that it is much slower in fetchin
fastq files than the standard Galaxy upload interface. The name of the tool is not totally
appropriate :smile:. However, if someone gives you directly the list of the SRR identifier,
the tool allows you to retrieve them with a minimum manipulations and without even interacting
the EBI SRA interface.
with the EBI SRA interface.

## Galaxy data libraries: the ultimate "upload" procedure !

Expand Down
8 changes: 4 additions & 4 deletions docs/bulk_RNAseq-IOC/12_QC.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,16 @@
----
![](images/tool_small.png)

1. Create a new history and name it `Quality Control`
1. Create a new history and name it `PRJNA630433 Quality Control`

2. Copy again all fastq.gz files from the data library into this history. You should
have 11 datasets in your history
have 12 datasets in your history

3. Select the `fastqc` tool.

4. In the `Short read data from your current history` menu, select the `multiple datasets` button. ![](images/multiple-datasets.png)

5. Shift-Click to select all 11 datasets
5. Shift-Click to select all 12 datasets

6. Click `Execute`
----
Expand All @@ -35,7 +35,7 @@

3. `Type of FastQC output?` : Select `Raw data`

4. `FastQC output` Cmd-Click (discontinuous, multiple selection) the *11* files named
4. `FastQC output` Cmd-Click (discontinuous, multiple selection) the *12* files named
`FastQC on xx: RawData`

5. Click `Execute`
Expand Down
11 changes: 11 additions & 0 deletions docs/bulk_RNAseq-IOC/13_exercices_week_01_review.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
## Issues with Galaxy uploads ?
- [x] Upload of a GTF local file ?
- [x] Upload by URL using the paste/fetch interface ?
- [x] Upload using the Galaxy tool "==Faster Download and Extract Reads in FASTQ format from NCBI SRA==" ?
- [x] Using the Galaxy data library "Libraries / IOC_bulk_RNAseq / PRJNA630433 / FASTQ files" ?

## Issues with Quality Control ?
- [x] Using FastQC tool ?
- [x] Using MultiQC tool ?

## Did you experiment importing datasets from data library as a collection ?
35 changes: 35 additions & 0 deletions docs/bulk_RNAseq-IOC/14-1_aligners.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
## Aligners softwares

The main alignment softwares are currently:

- BWA
- Bowtie
- STAR

They are all based on the Burrows-Wheeler Algorithm.
This implies to build a genome index in which the genome is recoded using the BWA, ensuring
very fast read alignments.

BWA-based aligner are CPU- and IO-demanding. In contrast they usually are not demanding in
RAM (with maybe the exception of STAR, for index building)

Aligners take FASTQ (FASTQ.gz) filesas well as a genome reference index
appropriately built as inputs.

They return BAM files which are compressed SAM files (Simple Alignment/Map).

The SAM format is really at the heart of RNAseq analyses, because it contains ==all== the
information needed to profile gene expressions from sequencing datasets.

==**Therefore, we highly recommend** to take a few hours to look at all the details of the SAM
format==, which can be found in the [GitHub repository](https://github.com/samtools/hts-specs).
You can start with [Sequence Alignment/Map format specification](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf),
and also have a closer look at
[Sequence Alignment/Map optional fields specification](https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf)

## Pseudo-aligners

Other aligners rather operate using a pseudo-alignment mode based on graphs of k-mers.

These include [Kallisto](https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-8.html)
and [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html)
96 changes: 96 additions & 0 deletions docs/bulk_RNAseq-IOC/14_reference_genomes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
## Reference Genomes

- Fasta format

- Assembly version, generally, associated to a number and a date of assembly

- A same assembly may be provided by various organisation (Genome Resource Consortium, Ensembl, NCBI, UCSC, etc)

This will be the same DNA sequence but formats may differ:

- by the name of the chromosomes (chr1, 1, NC_000001.11, ...)
- by the presence (or the absence) of unmapped contigs and haplotypes

## exemple 1: human genome

- GRCh37/hg19 - juil 2007
- GRCh38/hg38 - déc 2011
- GRCh39/hg39 - juin 2020 (repeat ++)

This various versions (or "releases") may in addition contain

- chromosomal regions "Aplotypes" (HLA, HBV inserts, etc…)
- unmapped contigs (regions which are significant assembly of reads, but are not assigned to a specific chromosome)

## exemple 2: mouse genome

<table>
<tr>
<td><strong>Release name</strong>
</td>
<td><strong>Date of release</strong>
</td>
<td><strong>Equivalent UCSC version</strong>
</td>
</tr>
<tr>
<td>GRCm39
</td>
<td>June 2020
</td>
<td>mm39
</td>
</tr>
<tr>
<td>GRCm38
</td>
<td>Dec 2011
</td>
<td>mm10
</td>
</tr>
<tr>
<td>NCBI Build 37
</td>
<td>Jul 2007
</td>
<td>mm9
</td>
</tr>
<tr>
<td>NCBI Build 36
</td>
<td>Feb 2006
</td>
<td>mm8
</td>
</tr>
<tr>
<td>NCBI Build 35
</td>
<td>Aug 2005
</td>
<td>mm7
</td>
</tr>
<tr>
<td>NCBI Build 34
</td>
<td>Mar 2005
</td>
<td>mm6
</td>
</tr>
</table>

## Annotations

It is important to note that annotations of genomes (GTF, GFF, etc.) although generally
equivalent, are strictly linked to their genome version because they refere to the DNA
sequences using the format of the release. This is why a GTF annotation file downloaded
from Ensembl is not interchangeable with a GTF annotation file from the UCSC or from another
organisation.

Moreover, since genome annotations may be considered as genome metadata (data on data), it is
normal and expected that genome annotation versions are different from genome versions and
that they are released at a faster pace.
36 changes: 36 additions & 0 deletions docs/bulk_RNAseq-IOC/15_splice_aware_mapping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
## Splice-Aware Aligners

For RNAseq analysis, it is common to speak of "Splice-Aware" aligners.

This is in particular
mandatory if you work with an eukaryote organism where mature messenger RNAs are made of
joint exons coming from genome regions separated by introns. Indeed, in this situation,
mRNA derived sequencing reads maybe split between distant genomic regions and distance
between two paired reads may be much higher than expected.

Actually, splice-aware aligners are just BWA-base aligners wrapped in additional code to take
into accounts split or distant pair alignments.

Importantly, if you are working with a model organism with available genome annotations,
splice-aware aligners will heavily rely on these annotations. Therefore, splice-Aware
aligners will most of the time work with GTF (or GFF3) input files, in addition to the
fastq files and the genome reference index.

However if your working organism is not a model organism, splice-aware aligners are still
useful, since the will reconstruct de novo the exon-exon junctions identified in the
sequencing reads. Indeed they have often been used to discover new mRNA isoforms !

![](images/splice_aware_alignment.png)

## software

Historically, the first popular splice-aware aligner has been TopHat and TopHat2,
based on bowtie and bowtie2 aligners, respectively.

Nowadays, the two popular splice-aware aligners are

- HISAT2 (based on bowtie2)
- STAR (with its own aligner implementation).
Note that in the case of STAR, you have the possibility to build index already incorporating
GTF informations. It is also possible to provide GTF information at the runtime of the
STAR alignment.
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,25 @@
In practice, with Illumina RNA-seq protocols you will most likely deal with either:

- Unstranded RNAseq data

- Stranded RNA-seq data produced with - kits and dUTP tagging (ISR)

This information should be provided with your FASTQ files, ask your sequencing facility!
This information, here called "the strandness of the libraries" should be provided by your
sequencing platform along with your FASTQ files. If you cannot find the information, ask
for it, it is always better than guessing ! If you are working on published data, the
strandness of the libraries can often be deduced from the kit reference for library preparation.

If not, try to find it on the site where you downloaded the data or
in the corresponding publication.
In the absence of strandness information, it is still possible to make a (very) good guess
using a tool called `Infer Experiment` from the `RSeQC` tool suite.

Another option is to estimate these parameters with a tool called `Infer Experiment` from
the `RSeQC` tool suite. This tool takes the output of your mappings (BAM files), selects
This tool takes the output of your mappings (BAM files), selects
a subsample of your reads and compares their genome coordinates and strands with those of
the reference gene model (from an annotation file).

Based on the strand of the genes, it can gauge whether sequencing is strand-specific,
and if so, how reads are stranded.



## Use of `Infer Experiment` tool

----
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
30 changes: 30 additions & 0 deletions docs/bulk_RNAseq-IOC/19_exercices_week_02_review.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
## Issues with Slack ?

## Issues with GitHub ?
- [x] Does everyone have a GitHub ID ?
- [x] Was everyone able to create a readme file and make a pull request to the repository
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have
generated during the first online meeting, with an extension .ga) and to add it in
the repository
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?

## Data upload in PSILO, then in Galaxy from Psilo
- [x] Did everyone upload the necessary data in its
[PSILO account](https://psilo.sorbonne-universite.fr) ?
- [x] Did everyone succeed to create direct download links ?
- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset`
in its Galaxy account ?

## Issues following the Galaxy training ?

[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html)

- Check whether `Relabel identifiers` tool is understood

- Check whether `Extract element identifiers` tool is understood. Is the output dataset
from this tool uploaded in the appropriate GitHub folder ?

## Check input datasets histories of the participants

... and their ability to create appropriate collection for the analysis
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit beeeebb

Please sign in to comment.