-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* week 2 (mapping) * prepack all the IOC * Lint 01_IOC_RNAseq_week_00.md * lint week 1 exercices * Update 13_exercices_week_01_review.md * week 2: progressing to week 2 exercises * Update 00_IOC_RNAseq_program.md
- Loading branch information
Showing
44 changed files
with
842 additions
and
26 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
## Issues with Galaxy uploads ? | ||
- [x] Upload of a GTF local file ? | ||
- [x] Upload by URL using the paste/fetch interface ? | ||
- [x] Upload using the Galaxy tool "==Faster Download and Extract Reads in FASTQ format from NCBI SRA==" ? | ||
- [x] Using the Galaxy data library "Libraries / IOC_bulk_RNAseq / PRJNA630433 / FASTQ files" ? | ||
|
||
## Issues with Quality Control ? | ||
- [x] Using FastQC tool ? | ||
- [x] Using MultiQC tool ? | ||
|
||
## Did you experiment importing datasets from data library as a collection ? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
## Aligners softwares | ||
|
||
The main alignment softwares are currently: | ||
|
||
- BWA | ||
- Bowtie | ||
- STAR | ||
|
||
They are all based on the Burrows-Wheeler Algorithm. | ||
This implies to build a genome index in which the genome is recoded using the BWA, ensuring | ||
very fast read alignments. | ||
|
||
BWA-based aligner are CPU- and IO-demanding. In contrast they usually are not demanding in | ||
RAM (with maybe the exception of STAR, for index building) | ||
|
||
Aligners take FASTQ (FASTQ.gz) filesas well as a genome reference index | ||
appropriately built as inputs. | ||
|
||
They return BAM files which are compressed SAM files (Simple Alignment/Map). | ||
|
||
The SAM format is really at the heart of RNAseq analyses, because it contains ==all== the | ||
information needed to profile gene expressions from sequencing datasets. | ||
|
||
==**Therefore, we highly recommend** to take a few hours to look at all the details of the SAM | ||
format==, which can be found in the [GitHub repository](https://github.com/samtools/hts-specs). | ||
You can start with [Sequence Alignment/Map format specification](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf), | ||
and also have a closer look at | ||
[Sequence Alignment/Map optional fields specification](https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf) | ||
|
||
## Pseudo-aligners | ||
|
||
Other aligners rather operate using a pseudo-alignment mode based on graphs of k-mers. | ||
|
||
These include [Kallisto](https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-8.html) | ||
and [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
## Reference Genomes | ||
|
||
- Fasta format | ||
|
||
- Assembly version, generally, associated to a number and a date of assembly | ||
|
||
- A same assembly may be provided by various organisation (Genome Resource Consortium, Ensembl, NCBI, UCSC, etc) | ||
|
||
This will be the same DNA sequence but formats may differ: | ||
|
||
- by the name of the chromosomes (chr1, 1, NC_000001.11, ...) | ||
- by the presence (or the absence) of unmapped contigs and haplotypes | ||
|
||
## exemple 1: human genome | ||
|
||
- GRCh37/hg19 - juil 2007 | ||
- GRCh38/hg38 - déc 2011 | ||
- GRCh39/hg39 - juin 2020 (repeat ++) | ||
|
||
This various versions (or "releases") may in addition contain | ||
|
||
- chromosomal regions "Aplotypes" (HLA, HBV inserts, etc…) | ||
- unmapped contigs (regions which are significant assembly of reads, but are not assigned to a specific chromosome) | ||
|
||
## exemple 2: mouse genome | ||
|
||
<table> | ||
<tr> | ||
<td><strong>Release name</strong> | ||
</td> | ||
<td><strong>Date of release</strong> | ||
</td> | ||
<td><strong>Equivalent UCSC version</strong> | ||
</td> | ||
</tr> | ||
<tr> | ||
<td>GRCm39 | ||
</td> | ||
<td>June 2020 | ||
</td> | ||
<td>mm39 | ||
</td> | ||
</tr> | ||
<tr> | ||
<td>GRCm38 | ||
</td> | ||
<td>Dec 2011 | ||
</td> | ||
<td>mm10 | ||
</td> | ||
</tr> | ||
<tr> | ||
<td>NCBI Build 37 | ||
</td> | ||
<td>Jul 2007 | ||
</td> | ||
<td>mm9 | ||
</td> | ||
</tr> | ||
<tr> | ||
<td>NCBI Build 36 | ||
</td> | ||
<td>Feb 2006 | ||
</td> | ||
<td>mm8 | ||
</td> | ||
</tr> | ||
<tr> | ||
<td>NCBI Build 35 | ||
</td> | ||
<td>Aug 2005 | ||
</td> | ||
<td>mm7 | ||
</td> | ||
</tr> | ||
<tr> | ||
<td>NCBI Build 34 | ||
</td> | ||
<td>Mar 2005 | ||
</td> | ||
<td>mm6 | ||
</td> | ||
</tr> | ||
</table> | ||
|
||
## Annotations | ||
|
||
It is important to note that annotations of genomes (GTF, GFF, etc.) although generally | ||
equivalent, are strictly linked to their genome version because they refere to the DNA | ||
sequences using the format of the release. This is why a GTF annotation file downloaded | ||
from Ensembl is not interchangeable with a GTF annotation file from the UCSC or from another | ||
organisation. | ||
|
||
Moreover, since genome annotations may be considered as genome metadata (data on data), it is | ||
normal and expected that genome annotation versions are different from genome versions and | ||
that they are released at a faster pace. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
## Splice-Aware Aligners | ||
|
||
For RNAseq analysis, it is common to speak of "Splice-Aware" aligners. | ||
|
||
This is in particular | ||
mandatory if you work with an eukaryote organism where mature messenger RNAs are made of | ||
joint exons coming from genome regions separated by introns. Indeed, in this situation, | ||
mRNA derived sequencing reads maybe split between distant genomic regions and distance | ||
between two paired reads may be much higher than expected. | ||
|
||
Actually, splice-aware aligners are just BWA-base aligners wrapped in additional code to take | ||
into accounts split or distant pair alignments. | ||
|
||
Importantly, if you are working with a model organism with available genome annotations, | ||
splice-aware aligners will heavily rely on these annotations. Therefore, splice-Aware | ||
aligners will most of the time work with GTF (or GFF3) input files, in addition to the | ||
fastq files and the genome reference index. | ||
|
||
However if your working organism is not a model organism, splice-aware aligners are still | ||
useful, since the will reconstruct de novo the exon-exon junctions identified in the | ||
sequencing reads. Indeed they have often been used to discover new mRNA isoforms ! | ||
|
||
![](images/splice_aware_alignment.png) | ||
|
||
## software | ||
|
||
Historically, the first popular splice-aware aligner has been TopHat and TopHat2, | ||
based on bowtie and bowtie2 aligners, respectively. | ||
|
||
Nowadays, the two popular splice-aware aligners are | ||
|
||
- HISAT2 (based on bowtie2) | ||
- STAR (with its own aligner implementation). | ||
Note that in the case of STAR, you have the possibility to build index already incorporating | ||
GTF informations. It is also possible to provide GTF information at the runtime of the | ||
STAR alignment. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
## Issues with Slack ? | ||
|
||
## Issues with GitHub ? | ||
- [x] Does everyone have a GitHub ID ? | ||
- [x] Was everyone able to create a readme file and make a pull request to the repository | ||
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ? | ||
- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have | ||
generated during the first online meeting, with an extension .ga) and to add it in | ||
the repository | ||
[ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ? | ||
|
||
## Data upload in PSILO, then in Galaxy from Psilo | ||
- [x] Did everyone upload the necessary data in its | ||
[PSILO account](https://psilo.sorbonne-universite.fr) ? | ||
- [x] Did everyone succeed to create direct download links ? | ||
- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset` | ||
in its Galaxy account ? | ||
|
||
## Issues following the Galaxy training ? | ||
|
||
[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html) | ||
|
||
- Check whether `Relabel identifiers` tool is understood | ||
|
||
- Check whether `Extract element identifiers` tool is understood. Is the output dataset | ||
from this tool uploaded in the appropriate GitHub folder ? | ||
|
||
## Check input datasets histories of the participants | ||
|
||
... and their ability to create appropriate collection for the analysis |
File renamed without changes.
File renamed without changes.
Oops, something went wrong.