week 2 (mapping) (#96)

* week 2 (mapping) * prepack all the IOC * Lint 01_IOC_RNAseq_week_00.md * lint week 1 exercices * Update 13_exercices_week_01_review.md * week 2: progressing to week 2 exercises * Update 00_IOC_RNAseq_program.md
ARTbio · Jan 3, 2024 · beeeebb · beeeebb
1 parent 7f4ac5f
commit beeeebb
Show file tree

Hide file tree

Showing 44 changed files with 842 additions and 26 deletions.
diff --git a/docs/bulk_RNAseq-IOC/00_IOC_RNAseq_program.md b/docs/bulk_RNAseq-IOC/00_IOC_RNAseq_program.md
@@ -1,7 +1,5 @@
-In this Interactive Online Companionship which will be held in **November 2023**,
-We will train to perform RNAseq analyses of Bulk RNAseq
-
-The final schedule will be published at mid October
+In this Interactive Online Companionship which will be held from **January 8th to March
+18th, 2024**, we will train to perform RNAseq analyses of Bulk RNAseq
 
 ### Week 0 - **3-hours Zoom video-conference**
 <!-- Faire un schedule sur google sheets -->

diff --git a/docs/bulk_RNAseq-IOC/01_IOC_RNAseq_week_00.md b/docs/bulk_RNAseq-IOC/01_IOC_RNAseq_week_00.md
@@ -57,7 +57,7 @@ Here, you'll find all weekly lessons, exercises, instructions, etc...
 **Importantly**, you, yes, you, are welcome to propose modifications or fixes to the STARTbio
 IOC web pages !
 Assuming that during this IOC you will become familiar with the use of GitHub, all you
-have to do is click on the pencil icon ![](images/github_pencil.png){width="25"}
+have to do is click on the pencil icon ![](images/github_pencil.png){width="100" align="absbottom"}
 at the top of each page and propose your modifications in a branch of our GitHub startbio
 repository.
 
@@ -90,9 +90,8 @@ If you have already a Slack account, you can connect to this account using this
     [Apple Desktop Slack](https://apps.apple.com/us/app/slack-for-desktop/id803453959?mt=12){:target="_blank"}
     | [Windows Desktop Slack](https://slack.com/intl/fr-fr/downloads/windows){:target="_blank"}
 
-Last but not least, Slack is not an option for this IOC !
-
-We will be extremely reluctant to communicate by email with you about this IOC.
+Last but not least, Slack is not an option: we will be extremely reluctant to communicate
+by email with you about this IOC.
 
 Indeed, emails capture information very poorly, because very often the subject headings
 are poorly chosen (or not chosen at all...), conversations by email deal with heterogeneous

diff --git a/docs/bulk_RNAseq-IOC/11_uploads.md b/docs/bulk_RNAseq-IOC/11_uploads.md
@@ -27,7 +27,7 @@ The first way to get input data in your Galaxy account is to transfer them from
 ==local computer== to ==Galaxy==.
 
 Note that whereas this mode may be convenient if you have _already_ the data on your computer,
-it is pretty in inefficient: it implies 2 transfers of data, first from the data
+it is pretty inefficient: it implies 2 transfers of data, first from the data
 source to your computer, secondly from your computer to Galaxy. When it comes to large files,
 as it is the case here with the fastq file collection of PRJNA630433, it matters a lot !
 
@@ -184,7 +184,7 @@ SRR11688225	ftp.sra.ebi.ac.uk/vol1/fastq/SRR116/025/SRR11688225/SRR11688225.fast
 SRR11688226	ftp.sra.ebi.ac.uk/vol1/fastq/SRR116/026/SRR11688226/SRR11688226.fastq.gz	SAMN14836337	Oc rep3
 SRR11688229	ftp.sra.ebi.ac.uk/vol1/fastq/SRR116/029/SRR11688229/SRR11688229.fastq.gz	SAMN14836334	Oc rep4
 ```
-f you open your tsv file (change the filename from `filereport_read_run_PRJNA630433_tsv.txt`
+If you open your tsv file (change the filename from `filereport_read_run_PRJNA630433_tsv.txt`
 to `filereport_read_run_PRJNA630433.tsv`) with your spreadsheet software, it is also easy
 to generate three additional tables, which will be useful to you later.
 
@@ -309,7 +309,7 @@ To finish with this tool, you probably noticed that it is much slower in fetchin
 fastq files than the standard Galaxy upload interface. The name of the tool is not totally
 appropriate :smile:. However, if someone gives you directly the list of the SRR identifier,
 the tool allows you to retrieve them with a minimum manipulations and without even interacting
-the EBI SRA interface.
+with the EBI SRA interface.
 
 ## Galaxy data libraries: the ultimate "upload" procedure !
 

diff --git a/docs/bulk_RNAseq-IOC/12_QC.md b/docs/bulk_RNAseq-IOC/12_QC.md
@@ -5,16 +5,16 @@
 ----
 ![](images/tool_small.png)
 
-  1. Create a new history and name it `Quality Control`
+  1. Create a new history and name it `PRJNA630433 Quality Control`
 
   2. Copy again all fastq.gz files from the data library into this history. You should
-  have 11 datasets in your history
+  have 12 datasets in your history
 
   3. Select the `fastqc` tool.
 
   4. In the `Short read data from your current history` menu, select the `multiple datasets` button. ![](images/multiple-datasets.png)
 
-  5. Shift-Click to select all 11 datasets
+  5. Shift-Click to select all 12 datasets
 
   6. Click `Execute`
   ----
@@ -35,7 +35,7 @@
 
   3. `Type of FastQC output?` : Select `Raw data`
 
-  4. `FastQC output` Cmd-Click (discontinuous, multiple selection) the *11* files named
+  4. `FastQC output` Cmd-Click (discontinuous, multiple selection) the *12* files named
   `FastQC on xx: RawData`
 
   5. Click `Execute`

diff --git a/docs/bulk_RNAseq-IOC/13_exercices_week_01_review.md b/docs/bulk_RNAseq-IOC/13_exercices_week_01_review.md
@@ -0,0 +1,11 @@
+## Issues with Galaxy uploads ?
+- [x] Upload of a GTF local file ?
+- [x] Upload by URL using the paste/fetch interface ?
+- [x] Upload using the Galaxy tool "==Faster Download and Extract Reads in FASTQ format from NCBI SRA==" ?
+- [x] Using the Galaxy data library "Libraries / IOC_bulk_RNAseq / PRJNA630433 / FASTQ files" ?
+
+## Issues with Quality Control ?
+- [x] Using FastQC tool ? 
+- [x] Using MultiQC tool ?
+
+## Did you experiment importing datasets from data library as a collection ?
diff --git a/docs/bulk_RNAseq-IOC/14-1_aligners.md b/docs/bulk_RNAseq-IOC/14-1_aligners.md
@@ -0,0 +1,35 @@
+## Aligners softwares
+
+The main alignment softwares are currently:
+
+- BWA
+- Bowtie
+- STAR
+
+They are all based on the Burrows-Wheeler Algorithm.
+This implies to build a genome index in which the genome is recoded using the BWA, ensuring
+very fast read alignments.
+
+BWA-based aligner are CPU- and IO-demanding. In contrast they usually are not demanding in
+RAM (with maybe the exception of STAR, for index building)
+
+Aligners take FASTQ (FASTQ.gz) filesas well as a genome reference index
+appropriately built as inputs.
+
+They return BAM files which are compressed SAM files (Simple Alignment/Map).
+
+The SAM format is really at the heart of RNAseq analyses, because it contains ==all== the
+information needed to profile gene expressions from sequencing datasets.
+
+==**Therefore, we highly recommend** to take a few hours to look at all the details of the SAM
+format==, which can be found in the [GitHub repository](https://github.com/samtools/hts-specs).
+You can start with [Sequence Alignment/Map format specification](https://github.com/samtools/hts-specs/blob/master/SAMv1.pdf),
+and also have a closer look at
+[Sequence Alignment/Map optional fields specification](https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf)
+
+## Pseudo-aligners
+
+Other aligners rather operate using a pseudo-alignment mode based on graphs of k-mers.
+
+These include [Kallisto](https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-8.html)
+and [Salmon](https://salmon.readthedocs.io/en/latest/salmon.html)
diff --git a/docs/bulk_RNAseq-IOC/14_reference_genomes.md b/docs/bulk_RNAseq-IOC/14_reference_genomes.md
@@ -0,0 +1,96 @@
+## Reference Genomes
+
+- Fasta format
+
+- Assembly version, generally, associated to a number and a date of assembly
+
+- A same assembly may be provided by various organisation (Genome Resource Consortium, Ensembl, NCBI, UCSC, etc)
+
+    This will be the same DNA sequence but formats may differ:
+
+    - by the name of the chromosomes (chr1, 1, NC_000001.11, ...)
+    - by the presence (or the absence) of unmapped contigs and haplotypes
+
+## exemple 1: human genome
+
+- GRCh37/hg19 - juil 2007
+- GRCh38/hg38 - déc 2011
+- GRCh39/hg39 - juin 2020 (repeat ++)
+
+This various versions (or "releases") may in addition contain
+
+- chromosomal regions "Aplotypes" (HLA, HBV inserts, etc…)
+- unmapped contigs (regions which are significant assembly of reads, but are not assigned to a specific chromosome)
+
+## exemple 2: mouse genome
+
+<table>
+  <tr>
+   <td><strong>Release name</strong>
+   </td>
+   <td><strong>Date of release</strong>
+   </td>
+   <td><strong>Equivalent UCSC version</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>GRCm39
+   </td>
+   <td>June 2020
+   </td>
+   <td>mm39
+   </td>
+  </tr>
+  <tr>
+   <td>GRCm38
+   </td>
+   <td>Dec 2011
+   </td>
+   <td>mm10
+   </td>
+  </tr>
+  <tr>
+   <td>NCBI Build 37
+   </td>
+   <td>Jul 2007
+   </td>
+   <td>mm9
+   </td>
+  </tr>
+  <tr>
+   <td>NCBI Build 36
+   </td>
+   <td>Feb 2006
+   </td>
+   <td>mm8
+   </td>
+  </tr>
+  <tr>
+   <td>NCBI Build 35
+   </td>
+   <td>Aug 2005
+   </td>
+   <td>mm7
+   </td>
+  </tr>
+  <tr>
+   <td>NCBI Build 34
+   </td>
+   <td>Mar 2005
+   </td>
+   <td>mm6
+   </td>
+  </tr>
+</table>
+
+## Annotations
+
+It is important to note that annotations of genomes (GTF, GFF, etc.) although generally
+equivalent, are strictly linked to their genome version because they refere to the DNA
+sequences using the format of the release. This is why a GTF annotation file downloaded
+from Ensembl is not interchangeable with a GTF annotation file from the UCSC or from another
+organisation.
+
+Moreover, since genome annotations may be considered as genome metadata (data on data), it is
+normal and expected that genome annotation versions are different from genome versions and
+that they are released at a faster pace.
diff --git a/docs/bulk_RNAseq-IOC/15_splice_aware_mapping.md b/docs/bulk_RNAseq-IOC/15_splice_aware_mapping.md
@@ -0,0 +1,36 @@
+## Splice-Aware Aligners
+
+For RNAseq analysis, it is common to speak of "Splice-Aware" aligners.
+
+This is in particular
+mandatory if you work with an eukaryote organism where mature messenger RNAs are made of
+joint exons coming from genome regions separated by introns. Indeed, in this situation,
+mRNA derived sequencing reads maybe split between distant genomic regions and distance
+between two paired reads may be much higher than expected.
+
+Actually, splice-aware aligners are just BWA-base aligners wrapped in additional code to take
+into accounts split or distant pair alignments.
+
+Importantly, if you are working with a model organism with available genome annotations,
+splice-aware aligners will heavily rely on these annotations. Therefore, splice-Aware
+aligners will most of the time work with GTF (or GFF3) input files, in addition to the
+fastq files and the genome reference index.
+
+However if your working organism is not a model organism, splice-aware aligners are still
+useful, since the will reconstruct de novo the exon-exon junctions identified in the
+sequencing reads. Indeed they have often been used to discover new mRNA isoforms !
+
+![](images/splice_aware_alignment.png)
+
+## software
+
+Historically, the first popular splice-aware aligner has been TopHat and TopHat2,
+based on bowtie and bowtie2 aligners, respectively.
+
+Nowadays, the two popular splice-aware aligners are
+
+- HISAT2 (based on bowtie2)
+- STAR (with its own aligner implementation).
+  Note that in the case of STAR, you have the possibility to build index already incorporating
+  GTF informations. It is also possible to provide GTF information at the runtime of the
+  STAR alignment.
diff --git a/docs/bulk_RNAseq-IOC/strandness.md → docs/bulk_RNAseq-IOC/16_strandness.md b/docs/bulk_RNAseq-IOC/strandness.md → docs/bulk_RNAseq-IOC/16_strandness.md
@@ -4,22 +4,25 @@
 In practice, with Illumina RNA-seq protocols you will most likely deal with either:
 
   - Unstranded RNAseq data
-
   - Stranded RNA-seq data produced with - kits and dUTP tagging (ISR)
 
-This information should be provided with your FASTQ files, ask your sequencing facility!
+This information, here called "the strandness of the libraries" should be provided by your
+sequencing platform along with your FASTQ files. If you cannot find the information, ask
+for it, it is always better than guessing ! If you are working on published data, the
+strandness of the libraries can often be deduced from the kit reference for library preparation.
 
-If not, try to find it on the site where you downloaded the data or
-in the corresponding publication.
+In the absence of strandness information, it is still possible to make a (very) good guess
+using a tool called `Infer Experiment` from the `RSeQC` tool suite.
 
-Another option is to estimate these parameters with a tool called `Infer Experiment` from
-the `RSeQC` tool suite. This tool takes the output of your mappings (BAM files), selects
+This tool takes the output of your mappings (BAM files), selects
 a subsample of your reads and compares their genome coordinates and strands with those of
 the reference gene model (from an annotation file).
 
 Based on the strand of the genes, it can gauge whether sequencing is strand-specific,
 and if so, how reads are stranded.
 
+
+
 ## Use of `Infer Experiment` tool
 
 ----

diff --git a/docs/bulk_RNAseq-IOC/hisat2.md → docs/bulk_RNAseq-IOC/17_hisat2.md b/docs/bulk_RNAseq-IOC/hisat2.md → docs/bulk_RNAseq-IOC/17_hisat2.md
diff --git a/docs/bulk_RNAseq-IOC/visu_map.md → ...ulk_RNAseq-IOC/18-1_UCSC_visualisation.md b/docs/bulk_RNAseq-IOC/visu_map.md → ...ulk_RNAseq-IOC/18-1_UCSC_visualisation.md
diff --git a/docs/bulk_RNAseq-IOC/star.md → docs/bulk_RNAseq-IOC/18_star.md b/docs/bulk_RNAseq-IOC/star.md → docs/bulk_RNAseq-IOC/18_star.md
diff --git a/docs/bulk_RNAseq-IOC/19_exercices_week_02_review.md b/docs/bulk_RNAseq-IOC/19_exercices_week_02_review.md
@@ -0,0 +1,30 @@
+## Issues with Slack ?
+
+## Issues with GitHub ?
+- [x] Does everyone have a GitHub ID ? 
+- [x] Was everyone able to create a readme file and make a pull request to the repository
+      [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
+- [x] Was everyone able to retrieve the galaxy workflow file (the one that you have
+      generated during the first online meeting, with an extension .ga) and to add it in
+      the repository
+      [ARTbio_064_IOC_Bulk-RNAseq](https://github.com/ARTbio/ARTbio_064_IOC_Bulk-RNAseq) ?
+
+## Data upload in PSILO, then in Galaxy from Psilo
+- [x] Did everyone upload the necessary data in its
+      [PSILO account](https://psilo.sorbonne-universite.fr) ?
+- [x] Did everyone succeed to create direct download links ? 
+- [x] Did everyone succeed to transfer its PSILO data into a Galaxy story `Input dataset`
+      in its Galaxy account ?
+
+## Issues following the Galaxy training ?
+
+[training to collection operations](https://training.galaxyproject.org/training-material/topics/galaxy-interface/tutorials/collections/tutorial.html)
+
+- Check whether `Relabel identifiers` tool is understood
+
+- Check whether `Extract element identifiers` tool is understood. Is the output dataset
+  from this tool uploaded in the appropriate GitHub folder ?
+
+## Check input datasets histories of the participants
+
+... and their ability to create appropriate collection for the analysis
diff --git a/docs/bulk_RNAseq-IOC/intro_counting.md → docs/bulk_RNAseq-IOC/20_intro_counting.md b/docs/bulk_RNAseq-IOC/intro_counting.md → docs/bulk_RNAseq-IOC/20_intro_counting.md
diff --git a/docs/bulk_RNAseq-IOC/count.md → docs/bulk_RNAseq-IOC/21_FeatureCounts.md b/docs/bulk_RNAseq-IOC/count.md → docs/bulk_RNAseq-IOC/21_FeatureCounts.md