-
Notifications
You must be signed in to change notification settings - Fork 10
Bulk RNA Seq tutorial
This tutorial will go through an end to end analysis for RNA-Seq analysis using Eoulsan. Before you start, you will need an installed copy of Eoulsan (see the installation guide).
This tutorial will use an example mouse RNA-Seq dataset generated by an Illumina NextSeq 500 sequencer.
Warning: You need about 48 GB of RAM to perform this tutorial as we use STAR to map read on mouse genome. The analysis need about 2 hours to perform. The mouse STAR index creation, is the longuest step of the analysis but thanks to Eoulsan repository feature, this index will be only computed once.
Outline of the process:
Step | Module used | Inputs | Output |
---|---|---|---|
Create a genome index for STAR | starindexgenerator | FASTA | ZIP |
Quality control of the raw FASTQ files with FastQC | fastqc | FASTQ | HTML |
Filter raw FASTQ files | filterreads | FASTQ | FASTQ |
Map the reads using STAR | mapreads | FASTQ | SAM |
Remove unmap and multimatches SAM entries | filtersam | SAM | SAM |
Convert SAM files into sorted and indexed BAM files | sam2bam | SAM | BAM |
Convert BAM files int BedGraph files | bam2bedgraph | BAM | BedGraph |
Convert BAM files int BED files | splice2bed | BAM | BED |
Count alignments with HTSeq-count | expression | SAM | TSV |
Convert expression files into XLSX files | expressionresultsannotation | TSV | XLSX |
Quality control of the analysis with MultiQC | multiqc | FASTQ | HTML |
Normalisation and differential analysis with DESeq 2 | normalization | TSV | TSV + graphics |
Convert differential analysis output files into XLSX files | expressionresultsannotation | TSV | XLSX |
# Step 1: get data (Raw data, genome sequence and annotations)
$ wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0267_S1_L001_R1_001.fastq.bz2 && \
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0268_S2_L001_R1_001.fastq.bz2 && \
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0269_S3_L001_R1_001.fastq.bz2 && \
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0270_S4_L001_R1_001.fastq.bz2 && \
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0271_S5_L001_R1_001.fastq.bz2 && \
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/rnaseq-illumina/2013_0272_S6_L001_R1_001.fastq.bz2 && \
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.fasta.bz2 && \
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gff.bz2 && \
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.gtf.bz2 && \
wget http://outils.genomique.biologie.ens.fr/leburon/downloads/eoulsan-samples/genomes/mm10ens91.tsv.bz2
# Step 2: create a mockup of the design file
$ eoulsan.sh createdesign *.fq.bz2 mm10ens91.fasta.bz2 mm10ens91.gff.bz2 mm10ens91.gtf.bz2 mm10ens91.tsv.bz2
# Step 3: edit the design file using a text editor or a spreadsheet
$ vim design.txt
# Step 3 bis: download the ready to use design file
$ wget https://github.com/GenomicParisCentre/eoulsan/wiki/files/design-rnaseq.txt && mv design-rnaseq.txt design.txt
# Step 4: download workflow file
$ wget https://github.com/GenomicParisCentre/eoulsan/wiki/files/workflow-rnaseq.xml
# Step 4: launch the analysis
$ eoulsan.sh exec workflow-rnaseq.xml design.txt
We provide in this section some samples files to test Eoulsan. These files have been produced during a mouse RNA-Seq experiment sequenced on an Illumina GAIIx.
-
Read files from two different conditions
-
Mouse genome file
-
GFF3 feature annotation file
-
GTF3 feature annotation file
-
Additional annotation file
-
Worflow file
- workflow-local.xml. This file is an already created workflow file to analyze previous reads files.
-
Design file
- design.txt. This file is an already created design file to analyze previous reads files.
In an empty directory, copy the reads, genome and annotation files, then you can create a mockup of the design file with the next command:
$ eoulsan.sh createdesign *.fq.bz2 mm10ens91.fasta.bz2 mm10ens91.gff.bz2 mm10ens91.gtf.bz2 mm10ens91.tsv.bz2
You can now modify the design file to add additional information using a text editor (e.g. vim, nano, emacs...) or a spreadsheet (MS Excel, Libreoffice...). Note that Eoulsan handle automatically compressed files.
For this tutorial, we will use the following design file that can be downloaded:
[Header]
DesignFormatVersion=2
GenomeFile=mm10ens91.fasta.bz2
GffFile=mm10ens91.gff.bz2
GtfFile=mm10ens91.gtf.bz2
AdditionalAnnotationFile=mm10ens91.tsv.bz2
[Experiments]
Exp.exp1.name=demo-mouse-rnaseq
[Columns]
SampleId SampleName Reads Date FastqFormat RepTechGroup Exp.exp1.Condition Exp.exp1.Reference UUID
20130267 KO1 2013_0267_S1_L001_R1_001.fastq.bz2 2015-10-04 fastq-sanger KO1 KO 0 7a392c56-3018-471f-8d39-8f1b03669c35
20130268 KO2 2013_0268_S2_L001_R1_001.fastq.bz2 2015-10-04 fastq-sanger KO2 KO 0 68a87e5b-1022-4288-914e-eb3f13985a7f
20130269 KO3 2013_0269_S3_L001_R1_001.fastq.bz2 2015-10-04 fastq-sanger KO3 KO 0 1dd4f0ab-6b31-41d5-93f6-69fd2a474697
20130270 WT1 2013_0270_S4_L001_R1_001.fastq.bz2 2015-10-04 fastq-sanger WT1 WT 1 36995c69-2685-4f29-822c-b3ee5bd423b4
20130271 WT2 2013_0271_S5_L001_R1_001.fastq.bz2 2015-10-04 fastq-sanger WT2 WT 1 7355ddc7-2426-48fc-aa1d-a88eefc8e5c2
20130272 WT3 2013_0272_S6_L001_R1_001.fastq.bz2 2015-10-04 fastq-sanger WT3 WT 1 004b53eb-e77f-4267-bf1d-98ef21cd5119
The experiment defined in this design file use a simple design. The Eoulsan design file allow users to define complex models for DESeq2 and choose the comparisons to perform. Technical replicates (e.g. FASTQ files for the same sample in different lines of a run) must have the same name in the "RepTechGroup" column.
More information about the design file is available on the reference site of Eoulsan.
To create a workflow file, the best solution is to reuse an existing workflow file and adapt it to your needs. You can download an existing workflow to analyse this data with the following command:
$ wget http://outils.genomique.biologie.ens.fr/eoulsan2/workflow-local.xml
The workflow file contains the list of the steps that will be executed by Eoulsan. Each step have parameters and it is related to a module to execute. Some step modules are only available in local mode (like differential analysis) or in distributed mode, see the modules page for more details. For each step you can change, add or remove parameters. Parameters are specific to a module, consult the documentation of the built-in steps for the list of available parameters of each step.
At least, there is a global section in the workflow file that override the values of Eoulsan configuration file. This section is useful to set for example the path of the temporary directory to use.
Unlike other mapper (e.g. Bowtie/BWA), STAR creation index can be customized using parameters. This step allow to create a custom index for STAR. If you don't need a custom STAR index, you can remove this step from your workflow. The following configuration example will use a GFF3 annotation file to get the position of the exons of the transcripts of the reference genome.
<!-- Create a custom STAR index using feature annotation -->
<step skip="false">
<module>starindexgenerator</module>
<parameters>
<!-- The overhang value must be greater than the length of the reads -->
<parameter>
<name>overhang</name>
<value>100</value>
</parameter>
<!-- Use a feature annotation file -->
<parameter>
<name>use.gtf.file</name>
<value>true</value>
</parameter>
<!-- The format of feature file is GFF -->
<parameter>
<name>features.file.format</name>
<value>gff</value>
</parameter>
<!-- The name of the feature for exon is "exon" in the feature annotation file -->
<parameter>
<name>gtf.feature.exon</name>
<value>exon</value>
</parameter>
<!-- For each exon feature we can get the transcript Id using the "Parent" tag -->
<parameter>
<name>gtf.tag.exon.parent.transcript</name>
<value>Parent</value>
</parameter>
</parameters>
</step>
This step allow to launch FastQC on the raw data. To enable this step, just add the following lines in your workflow file. More parameters are available in the FastQC module, please see the FastQC module page in the reference manual for more informations.
<!-- FastQC of raw reads -->
<step id="rawfastqc" skip="false">
<module>fastqc</module>
</step>
This step allow to remove adapters from raw reads. This step is often useless for RNA-Seq, so we recommend to add this step to your worklfow only if FastQC has shown an adapter contamination as it takes a lot of time to perform.
<!-- Remove adapters with Trim Galore!/cutadapt -->
<step skip="false" discardoutput="asap">
<module>trimadapt</module>
<parameters>
<!-- Trim low-quality ends from reads -->
<parameter>
<name>quality</name>
<value>20</value>
</parameter>
<!-- Maximum allowed error rate -->
<parameter>
<name>error</name>
<value>0.1</value>
</parameter>
<!-- Overlap with adapter sequence required to trim a sequence -->
<parameter>
<name>stringency</name>
<value>1</value>
</parameter>
<!-- Discard reads that became shorter than parameter value because of either quality or adapter trimming -->
<parameter>
<name>length</name>
<value>20</value>
</parameter>
<!-- Paired-end data -->
<parameter>
<name>is.paired</name>
<value>no</value>
</parameter>
</parameters>
</step>
This step allow to remove bad quality reads.
The following configuration example will remove reads that do not pass Illumina filter, trim polyN reads and remove all the reads with a length lower than 40 and with a mean quality lower to 30.
For more information about the filterreads
module see the reference documentation.
<!-- Filter reads -->
<step skip="false" discardoutput="asap">
<module>filterreads</module>
<parameters>
<!-- Remove all the read that do not pass Illumina QC -->
<parameter>
<name>illuminaid</name>
<value></value>
</parameter>
<!-- Remove polyN tails of the reads -->
<parameter>
<name>trimpolynend</name>
<value></value>
</parameter>
<!-- Remove reads with length lower than 40 -->
<parameter>
<name>length.minimal.length.threshold</name>
<value>40</value>
</parameter>
<!-- Remove reads with mean QScore < 30 -->
<parameter>
<name>quality.threshold</name>
<value>30</value>
</parameter>
</parameters>
</step>
This step allow to map FASTQ read. The following configuration example will use STAR with default parameters.
STAR is only parameted to keep unmapped read in output SAM files. Eoulsan is bundled with many mappers (Bowtie, Bowtie2, BWA, Minimap, Gmap/Gsnap, STAR...). For more information about the mapreads
module see the reference documentation.
Note that with STAR or Minimap2, custom indexes can used.
To create this custom indexes, you need to add a dedicated step (the starindexgenerator module for STAR and the minimap2indexgenerator module for Minimap2.
<!-- Mapping of the reads -->
<step skip="false" discardoutput="asap">
<module>mapreads</module>
<parameters>
<!-- Use use STAR as mapper -->
<parameter>
<name>mapper</name>
<value>star</value>
</parameter>
<!-- The version of STAR to use is 2.6.1b -->
<parameter>
<name>mapper.version</name>
<value>2.6.1b</value>
</parameter>
<!-- We define here the additional STAR parameter (keep unmap read in SAM output) -->
<parameter>
<name>mapper.arguments</name>
<value>--outSAMunmapped Within</value>
</parameter>
</parameters>
</step>
In this step, we will remove all SAM entries that are unmapped or multimatches.
Like the filterreads
module, this step is highly configurable with additional plugins.
For more information about the filtersam
module see the reference documentation.
<!-- Filtering of the SAM alignments -->
<step skip="false" discardoutput="asap">
<module>filtersam</module>
<parameters>
<!-- Remove the unmap entries in the files -->
<parameter>
<name>removeunmapped</name>
<value>true</value>
</parameter>
<!-- Remove the multimaches alignments -->
<parameter>
<name>removemultimatches</name>
<value>true</value>
</parameter>
</parameters>
</step>
With this optional step, we will convert SAM files converted in the previous step in BAM files.
The generated BAM files are sorted by coordinates. For each created BAM, an index file (BAI file) is also created.
For more information about the sam2bam
module see the reference documentation.
<!-- Convert SAM to BAM file and sort the BAM file by coordinate -->
<step skip="false">
<module>sam2bam</module>
<parameters>
<!-- Compression level of the BAM file -->
<parameter>
<name>compression.level</name>
<value>5</value>
</parameter>
</parameters>
</step>
In this optional step, BAM files will be converted in BedGraph files.
Note that this step require the bam-to-bedgraph.xml
Galaxy tool wrapper that is available on the eoulsan-tool GitHub repository.
<!-- Create Bedgraph files from the BAM files -->
<step dataproduct="match" skip="false">
<module>bam2bedgraph</module>
<parameters/>
</step>
With this optional step, positions of the alignments will be converted in BED 12 files.
For more information about the splice2bed
module see the reference documentation.
<!-- Create Bed files from the SAM files -->
<step dataproduct="match" skip="false">
<module>splice2bed</module>
<parameters/>
</step>
This step allow to compute the expression of the gene/transcripts.
Currently only the htseq-count
counter is implemented in Eoulsan.
Most of the option of this module are the standard options of HTSeq-count.
For more information about this expression
module, see the HTSeq-count and module documentation.
<!-- Compute expression -->
<step skip="false">
<module>expression</module>
<parameters>
<!-- We use the fast Java implementation of htseq-count to perform the counting -->
<parameter>
<name>counter</name>
<value>htseq-count</value>
</parameter>
<!-- The output file file will be in TSV format -->
<parameter>
<name>output.file.format</name>
<value>tsv</value>
</parameter>
<!-- The input feature annotation file is in GTF format -->
<parameter>
<name>features.file.format</name>
<value>gtf</value>
</parameter>
<!-- The feature type to use is "exon", we count by exon -->
<parameter>
<name>genomic.type</name>
<value>exon</value>
</parameter>
<!-- We aggregate the counts by genes -->
<parameter>
<name>attribute.id</name>
<value>gene_id</value>
</parameter>
<!-- The library was oriented -->
<parameter>
<name>stranded</name>
<value>yes</value>
</parameter>
<!-- The use the union mode as overlap mode -->
<parameter>
<name>overlap.mode</name>
<value>union</value>
</parameter>
<!-- We do not remove ambiguous cases -->
<parameter>
<name>remove.ambiguous.cases</name>
<value>false</value>
</parameter>
</parameters>
</step>
The expression files generated in the previous step are simple tabulated text file with only two columns: the identifiers of the feature and the counts.
This files are often opened with spreadsheet software like Microsoft Excel.
Unfortunately this tools can misinterpret the content of the file.
That's why with this step the TSV files are converted in XLSX files (or ODS for LibreOffice).
Moreover, this step add several column with useful information (e.g. gene name, gene description...) that is coming from the additional annotation file.
For more information about this expressionresultsannotation
module, see the module documentation.
<!-- Convert expression files to XLSX files -->
<step skip="false">
<module>expressionresultsannotation</module>
<parameters>
<!-- Select the output of the step, ODS and TSV format are also supported -->
<parameter>
<name>output.format</name>
<value>xlsx</value>
</parameter>
</parameters>
</step>
MultiQC is tool that has been developed to aggregate all the FastQC reports of an analysis in one report.
Then, it has been enhanced to support many other tools.
With this step, we will generate a MultiQC report with data that come from the FastQC, mapping and expression steps.
For more information about this multiqc
module, see the module documentation.
<!-- MultiQC -->
<step skip="false">
<module>multiqc</module>
<parameters>
<!-- MultiQC will contain reports from FastQC, STAR and HTSeq-count -->
<parameter>
<name>reports</name>
<value>fastqc,mapreads,expression</value>
</parameter>
<!-- Docker is required to launch this step -->
<parameter>
<name>use.docker</name>
<value>true</value>
</parameter>
</parameters>
</step>
With this step, the expression data will be normalized and then the differential analysis will be performed.
This module allow to use both standard and complex experimental design.
The design file documentation, explain how to define the experimental design of your experiment.
This module can only handle experimental design with biological replicates.
If you don't have any biological replicates, please use instead the DESeq1 module.
The r.execution.mode
parameter allow to choose the method to use for launching R (local, RServe or Docker).
We recommend you to use docker.
For more information about this deseq2
module, see the module documentation.
Note: You must define the main.docker.uri
parameter in your global section of your workflow file (or in your configuration file) to enable docker.
Usually the value of this parameter id unix:///var/run/docker.sock
.
<!-- Normalization and differential analysis with DESeq 2 -->
<step skip="false">
<module>deseq2</module>
<parameters>
<!-- Generate normalization plots -->
<parameter>
<name>norm.fig</name>
<value>true</value>
</parameter>
<!-- Genearte differential analysis plots -->
<parameter>
<name>diffana.fig</name>
<value>true</value>
</parameter>
<!-- Enable Normalization -->
<parameter>
<name>norm.diffana</name>
<value>true</value>
</parameter>
<!-- Enable differential analysis -->
<parameter>
<name>diffana</name>
<value>true</value>
</parameter>
<!-- Set the size factor type to "ratio" -->
<parameter>
<name>size.factors.type</name>
<value>ratio</value>
</parameter>
<!-- Set the fit type to "parametric" -->
<parameter>
<name>fit.type</name>
<value>parametric</value>
</parameter>
<!-- Use Wald as statistical test -->
<parameter>
<name>statistical.test</name>
<value>Wald</value>
</parameter>
<!-- Use Docker to execute this step -->
<parameter>
<name>r.execution.mode</name>
<value>docker</value>
</parameter>
<!-- Define the Docker image to use (optional) -->
<parameter>
<name>docker.image</name>
<value>bioconductor/release_sequencing:3.1</value>
</parameter>
</parameters>
</step>
The text files generated by the previous step are simple tabulated text files.
This files are often opened with spreadsheet software like Microsoft Excel.
Unfortunately this tools can misinterpret the content of the text file.
That's why with this step the TSV files are converted in XLSX files (or ODS for LibreOffice).
Moreover, this step add several column with useful information (e.g. gene name, gene description...) that is coming from the additional annotation file.
For more information about this diffanaresultsannotation
module, see the module documentation.
<!-- Convert ourput of the differential analysis step to XLSX files -->
<step skip="false">
<module>diffanaresultsannotation</module>
<parameters>
<!-- Select the output of the step, ODS and TSV format are also supported -->
<parameter>
<name>output.format</name>
<value>xlsx</value>
</parameter>
</parameters>
</step>
This section of the workflow file define the global configuration of the analysis. For this analysis, path to additional data format, Galaxy tools and Docker connection URI are set.
Note: You can also define this global parameters in a configuration file like the default Eoulsan configuration file ~/.eoulsan
, see the configuration file documentation for more information.
<!-- This section contains globals parameters that can also be define in a configuration file. -->
<globals>
<!-- Define the location of the Galaxy tools/modules to use -->
<parameter>
<name>main.galaxy.tool.path</name>
<value>https://raw.githubusercontent.com/GenomicParisCentre/eoulsan-tools/master/galaxytools</value>
</parameter>
<!-- Define the location of additional formats to use -->
<parameter>
<name>main.format.path</name>
<value>https://raw.githubusercontent.com/GenomicParisCentre/eoulsan-tools/master/formats</value>
</parameter>
<!-- Define the location of the Docker connection -->
<parameter>
<name>main.docker.uri</name>
<value>unix:///var/run/docker.sock</value>
</parameter>
</globals>
Once your design file and workflow file are ready, you can launch Eoulsan analysis with the following command:
$ eoulsan.sh exec workflow-local.xml design.txt
Warning: To perform the normalization and differential analysis steps of this workflow, this demo requires Docker. If you want to run this demo without Docker, you must must install R (or a RServe server) and the related packages (See differential analysis module documentation for more information) and change the execution.mode parameter of the normalization and diffana steps.
Once started and before starting analysis, Eoulsan will check if:
- The design file and workflow file are valid
- All the modules related to the steps exist
- The workflow of all the steps are valid
- The order of the steps is correct (a step can not use data generated after its end)
- The input files are valid (fasta and annotation files)
- A genome mapper index is needed. In this case, a step to generate this index is automatically added
If successful, you will obtain a new directory name like eoulsan-20190910-101235
with log files about the analysis.
Results files are stored in the current directory of the user.