This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The pipeline is built using Nextflow and processes data using the following steps:
- FastQC - read quality control
- fastq-mcf - fastq trimming
- HISAT2 - memory efficient splice aware alignment to a reference
- HISAT2 with rRNA reference - mapped to the reference rRNA sequences
- SAMtools - sort and index alignments
- featureCounts - read counting relative to gene and biotype
- RSeQC - various RNA-seq QC metrics
- ReadCoverage.jl - calculate absolute and relative gene body coverage
- RSEM via Bowtie2 - alignment and quantification of expression levels
- MultiQC - aggregate report, describing results of the whole pipeline
- Pipeline information - report metrics generated during the workflow execution
All output files are saved at results
directory. If you want to name it arbitrarily, you can change the directory name with the --outdir [dirname]
option.
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.
For further reading and documentation see the FastQC help.
NB: FastQC outputs two types of plots: before trimming (raw) and after trimming (trimmed). Trimming removes adapter sequences and potentially poor quality regions (use Fastqmcf, described below).
Output directory: results/fastqc (raw) and results/fastqc.trim (trimmed)
*.raw_fastqc.html
(raw),*.trim_fastqc.html
(trimmed)- FastQC report, containing quality metrics for your untrimmed raw fastq files
zips/*.raw_fastqc.zip
(raw),*.trim_fastqc.zip
(trimmed)- zip file containing the FastQC report, tab-delimited data file and plot images
fastq-mcf scans a sequence file for adapters, and, based on a log-scaled threshold, determines a set of clipping parameters and performs clipping. Also does skewing detection and quality filtering.
See usage for the options available in this pipeline. For further reading and documentation see the description.
Output directory: results/fastqmcf
*.trim.fastq.gz
- Trimmed fastq files (compressed)
HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes as well as to a single reference genome. It introduced a new indexing scheme called a Hierarchical Graph FM index (HGFM) which when combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. The HISAT2 route through the pipeline is a good option if you have memory limitations on your compute.
See usage for the options available in this pipeline.
The original BAM files generated by the selected alignment algorithm are further processed with SAMtools to sort them by coordinate, for indexing, as well as to generate read mapping statistics.
Output directory: results/hisat2
*.bam
- binary files that contains sequence alignment data
*.bam.bai
- bam index files provide an index of the corresponding bam file
*.bam.flagstat
- counts of the number of alignments for each FLAG type from each bam files
logs/*_summary.txt
- HISAT2 alignment report containing the mapping results summary.
Output directory: results/merged_output_files
merged_hisat2_totalseq.txt
- merged file with flagstat results
HISAT2, a fast and sensitive mapping tool as described above, is also used to quantify the number of reads derived from rRNAs. In this step, reads are mapped to the reference rRNA sequences.
See usage for the options available in this pipeline.
Output directory: results/hisat2_rrna
*.rrna.bam
- binary files that contains sequence alignment data
*.rrna.bam.bai
- bam index files provide an index of the corresponding bam file
*.rrna.bam.flagstat
- counts of the number of alignments for each FLAG type from each bam files
logs/*_summary.txt
- HISAT2 alignment report containing the mapping results summary.
featureCounts from the Subread package is a quantification tool used to summarise the mapped read distribution over genomic features such as genes, exons, promotors, gene bodies, genomic bins and chromosomal locations. We can also use featureCounts to count overlaps with different classes of genomic features. This provides an additional QC to check which features are most abundant in the sample, and to highlight potential problems such as rRNA contamination.
We output more detailed QC plots using our own custom rRNA, mitochondrial, and histone annotations, in addition to the usual transcript counts. Check out the details on local_annotation and see usage for the options available in this pipeline.
Output directory: results/featureCounts
gene_counts/*_gene.featureCounts.txt
- gene-level quantification results for each sample
biotype_counts/*_mqc.tsv, biotype_counts/*_mqc.txt
- MultiQC custom content files used to plot biotypes in report
gene_count_summaries/*_gene.featureCounts.txt.summary
- overall statistics about the counts
Output directory: results/featureCounts_Histone, results/featureCounts_Mitocondria, results/featureCounts_rRNA
*.featureCounts.txt
- quantification results for each annotation
*.featureCounts.txt.summary
- overall statistics about the counts for each annotation
Output directory: results/merged_output_files
-
merged_featureCounts_gene.txt
- Matrix of gene-level raw counts across all samples.
-
merged_featureCounts_gene_TPM.txt
- Matrix of gene-level TPM (Transcripts Per Kilobase Million) counts across all samples
-
merged_featureCounts_gene_ERCC.txt
- Matrix of ERCC (RNA Spike-in Controls for Gene Expression) count. Output only if the sample contains ERCC.
-
merged_featureCounts_gene_ERCC_log.txt
- Matrix of ERCC (RNA Spike-in Controls for Gene Expression) base-10 logarithm count. Output only if the sample contains ERCC.
-
featureCounts assigned rate plot
- This plot calculates the assigned rate in a different mothod than the featureCounts alignment summary plot. If
--count_fractionally
,--allow_overlap
, and--allow_multimap
options for featureCounts are set (by default), the rate is calculated based on fractional counts: When a read hasx
alignments (multi-mapping) and overlaps withy
meta-features/features, each overlapping meta-features/features will receive a fractional count of1/(x*y)
.
- This plot calculates the assigned rate in a different mothod than the featureCounts alignment summary plot. If
RSeQC is a package of scripts designed to evaluate the quality of RNA-seq data. This pipeline runs several, but not all RSeQC scripts. Default will run : bam2wig.py
, infer_experiment.py
, read_distribution.py
and inner_distance.py
(on pair-end).
- We note that ramdaq skips some of the bam-QC processes for the BAM files with low mapped reads (less than 10 by default; This threshold can be changed by the option --min_mapped_reads N). Thus, the samples with low mapped reads are omitted RSeQC and ReadCoverage.jl plots in MultiQC report.
This script predicts the "strandedness" of the protocol (i.e. unstranded, sense or antisense) that was used to prepare the sample for sequencing by assessing the orientation in which aligned reads overlay gene features in the reference genome.
RSeQC documentation: infer_experiment.py
Output directory: results/rseqc/infer_experiment/
*.inferexp.txt
- File containing fraction of reads mapping to given strandedness configurations.
This tool calculates how mapped reads are distributed over genomic features.
RSeQC documentation: read_distribution.py
Output directory: results/rseqc/read_distribution/
*.readdist.txt
- File containing fraction of reads mapping to genome feature e.g. CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions etc.
Output directory: results/merged_output_files
merged_readdist_totalread.txt
- The output of read distribution for aggregation of reads assigned to genomes.
The inner distance script tries to calculate the inner distance between two paired-end reads.
RSeQC documentation: inner_distance.py
Output directory: results/rseqc/inner_distance/
*.inner_distance_plot.pdf
- PDF file containing inner distance plot.
*.inner_distance_plot.r
- R script used to generate pdf plot above.
*.inner_distance_freq.txt
- File containing frequency of insert sizes.
*.inner_distance.txt
- File containing mean, median and standard deviation of insert sizes.
Instead of RSeQC::geneBody_coverage.py
we use ReadCoverage.jl, which can calculate absolute and relative gene body coverage of bulk/single-cell RNA-seq data very fast.
Output directory: results/rseqc/genebody_coverage/
*.geneBodyCoverage.txt
- File containing results of calculating the RNA-seq reads coverage over gene body
Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s of characters to relatively long (e.g. mammalian) genomes. Bowtie2 is often the first step in pipelines for comparative genomics, including for variation calling, ChIP-seq, RNA-seq, BS-seq, but here we use it for transcripts mapping for isoform-level expression quantification by RSEM.
RSEM is a software package for estimating gene and isoform expression levels from RNA-seq data. It has been widely looked as one of the most accurate quantification tools for RNA-seq analysis. RSEM wraps other popular tools (i.e. STAR, Bowtie2, HISAT2; Bowtie2 is used in this pipeline) to map reads to the transcriptome before quantifying the gene- and isoform-level expression.
Output directory: results/rseqc/rsem_bowtie2_allgenes/
*.genes.results
- gene-level quantification results for each sample.
*.isoforms.results
- transcript-level quantification results for each sample.
*.stat
- A folder that contains model parameters learned. All model related statistics are stored in this folder.
Output directory: results/merged_output_files/
merged_rsemResults_genes_TPM.txt
- Matrix of gene-level expression quantification results in TPM for all samples.
merged_rsemResults_isoforms_TPM.txt
- Matrix of transcript-level expression quantification results in TPM for all samples.
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.
The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability.
Output directory: results/multiqc
Project_multiqc_report.html
- MultiQC report - a standalone HTML file that can be viewed in your web browser
Project_multiqc_data/
- Directory containing parsed statistics from the different tools used in the pipeline
For more information about how to use MultiQC reports, see http://multiqc.info
The following are the sample MultiQC reports output using the test data.
-
Mouse
-
Human
Output directory: results/pipeline_info
After execution, ramdaq outputs an HTML execution report which includes useful metrics about a workflow execution such as the actual commands executed by each process. (see below for details).
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.csv
.