Skip to content

Latest commit

 

History

History
281 lines (182 loc) · 15.4 KB

output.md

File metadata and controls

281 lines (182 loc) · 15.4 KB

ramdaq: Output

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

All output files are saved at results directory. If you want to name it arbitrarily, you can change the directory name with the --outdir [dirname] option.

FastQC

FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.

For further reading and documentation see the FastQC help.

NB: FastQC outputs two types of plots: before trimming (raw) and after trimming (trimmed). Trimming removes adapter sequences and potentially poor quality regions (use Fastqmcf, described below).

Output directory: results/fastqc (raw) and results/fastqc.trim (trimmed)

  • *.raw_fastqc.html(raw), *.trim_fastqc.html(trimmed)
    • FastQC report, containing quality metrics for your untrimmed raw fastq files
  • zips/*.raw_fastqc.zip(raw), *.trim_fastqc.zip(trimmed)
    • zip file containing the FastQC report, tab-delimited data file and plot images

fastq-mcf

fastq-mcf scans a sequence file for adapters, and, based on a log-scaled threshold, determines a set of clipping parameters and performs clipping. Also does skewing detection and quality filtering.

See usage for the options available in this pipeline. For further reading and documentation see the description.

Output directory: results/fastqmcf

  • *.trim.fastq.gz
    • Trimmed fastq files (compressed)

MultiQC - FastQC sequence counts plot

MultiQC - FastQC mean quality scores plot

MultiQC - FastQC sequence length distribution plot

HISAT2

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes as well as to a single reference genome. It introduced a new indexing scheme called a Hierarchical Graph FM index (HGFM) which when combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. The HISAT2 route through the pipeline is a good option if you have memory limitations on your compute.

See usage for the options available in this pipeline.

SAMtools

The original BAM files generated by the selected alignment algorithm are further processed with SAMtools to sort them by coordinate, for indexing, as well as to generate read mapping statistics.

Output directory: results/hisat2

  • *.bam
    • binary files that contains sequence alignment data
  • *.bam.bai
    • bam index files provide an index of the corresponding bam file
  • *.bam.flagstat
    • counts of the number of alignments for each FLAG type from each bam files
  • logs/*_summary.txt
    • HISAT2 alignment report containing the mapping results summary.

Output directory: results/merged_output_files

  • merged_hisat2_totalseq.txt
    • merged file with flagstat results

MultiQC - HISAT2 alignment scores plot

HISAT2 with rRNA reference

HISAT2, a fast and sensitive mapping tool as described above, is also used to quantify the number of reads derived from rRNAs. In this step, reads are mapped to the reference rRNA sequences.

See usage for the options available in this pipeline.

Output directory: results/hisat2_rrna

  • *.rrna.bam
    • binary files that contains sequence alignment data
  • *.rrna.bam.bai
    • bam index files provide an index of the corresponding bam file
  • *.rrna.bam.flagstat
    • counts of the number of alignments for each FLAG type from each bam files
  • logs/*_summary.txt
    • HISAT2 alignment report containing the mapping results summary.

MultiQC - HISAT2 rRNA alignment scores plot

featureCounts

featureCounts from the Subread package is a quantification tool used to summarise the mapped read distribution over genomic features such as genes, exons, promotors, gene bodies, genomic bins and chromosomal locations. We can also use featureCounts to count overlaps with different classes of genomic features. This provides an additional QC to check which features are most abundant in the sample, and to highlight potential problems such as rRNA contamination.

We output more detailed QC plots using our own custom rRNA, mitochondrial, and histone annotations, in addition to the usual transcript counts. Check out the details on local_annotation and see usage for the options available in this pipeline.

Output directory: results/featureCounts

  • gene_counts/*_gene.featureCounts.txt
    • gene-level quantification results for each sample
  • biotype_counts/*_mqc.tsv, biotype_counts/*_mqc.txt
    • MultiQC custom content files used to plot biotypes in report
  • gene_count_summaries/*_gene.featureCounts.txt.summary
    • overall statistics about the counts

Output directory: results/featureCounts_Histone, results/featureCounts_Mitocondria, results/featureCounts_rRNA

  • *.featureCounts.txt
    • quantification results for each annotation
  • *.featureCounts.txt.summary
    • overall statistics about the counts for each annotation

Output directory: results/merged_output_files

  • merged_featureCounts_gene.txt

    • Matrix of gene-level raw counts across all samples.
  • merged_featureCounts_gene_TPM.txt

    • Matrix of gene-level TPM (Transcripts Per Kilobase Million) counts across all samples
  • merged_featureCounts_gene_ERCC.txt

    • Matrix of ERCC (RNA Spike-in Controls for Gene Expression) count. Output only if the sample contains ERCC.
  • merged_featureCounts_gene_ERCC_log.txt

    • Matrix of ERCC (RNA Spike-in Controls for Gene Expression) base-10 logarithm count. Output only if the sample contains ERCC.
  • featureCounts biotype plot MultiQC - featureCounts biotype plot

  • featureCounts alignment summary plot MultiQC - featureCounts alignment plot

  • featureCounts assigned rate plot

    • This plot calculates the assigned rate in a different mothod than the featureCounts alignment summary plot. If --count_fractionally, --allow_overlap, and --allow_multimap options for featureCounts are set (by default), the rate is calculated based on fractional counts: When a read has x alignments (multi-mapping) and overlaps with y meta-features/features, each overlapping meta-features/features will receive a fractional count of 1/(x*y). MultiQC - featureCounts assigned rate plot
  • ercc_correlation_barplot MultiQC - ERCC correlation plot

RSeQC

RSeQC is a package of scripts designed to evaluate the quality of RNA-seq data. This pipeline runs several, but not all RSeQC scripts. Default will run : bam2wig.py, infer_experiment.py, read_distribution.py and inner_distance.py (on pair-end).

  • We note that ramdaq skips some of the bam-QC processes for the BAM files with low mapped reads (less than 10 by default; This threshold can be changed by the option --min_mapped_reads N). Thus, the samples with low mapped reads are omitted RSeQC and ReadCoverage.jl plots in MultiQC report.

Infer experiment

This script predicts the "strandedness" of the protocol (i.e. unstranded, sense or antisense) that was used to prepare the sample for sequencing by assessing the orientation in which aligned reads overlay gene features in the reference genome.

RSeQC documentation: infer_experiment.py

Output directory: results/rseqc/infer_experiment/

  • *.inferexp.txt
    • File containing fraction of reads mapping to given strandedness configurations.

MultiQC - RSeQC infer experiment plot

Read distribution

This tool calculates how mapped reads are distributed over genomic features.

RSeQC documentation: read_distribution.py

Output directory: results/rseqc/read_distribution/

  • *.readdist.txt
    • File containing fraction of reads mapping to genome feature e.g. CDS exon, 5’UTR exon, 3’ UTR exon, Intron, Intergenic regions etc.

Output directory: results/merged_output_files

  • merged_readdist_totalread.txt
    • The output of read distribution for aggregation of reads assigned to genomes.

MultiQC - RSeQC infer experiment plot

Inner distance

The inner distance script tries to calculate the inner distance between two paired-end reads.

RSeQC documentation: inner_distance.py

Output directory: results/rseqc/inner_distance/

  • *.inner_distance_plot.pdf
    • PDF file containing inner distance plot.
  • *.inner_distance_plot.r
    • R script used to generate pdf plot above.
  • *.inner_distance_freq.txt
    • File containing frequency of insert sizes.
  • *.inner_distance.txt
    • File containing mean, median and standard deviation of insert sizes.

MultiQC - RSeQC infer experiment plot

ReadCoverage.jl

Instead of RSeQC::geneBody_coverage.py we use ReadCoverage.jl, which can calculate absolute and relative gene body coverage of bulk/single-cell RNA-seq data very fast.

Output directory: results/rseqc/genebody_coverage/

  • *.geneBodyCoverage.txt
    • File containing results of calculating the RNA-seq reads coverage over gene body

MultiQC - ReadCoverage.jl plot

RSEM via Bowtie2

Bowtie2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s of characters to relatively long (e.g. mammalian) genomes. Bowtie2 is often the first step in pipelines for comparative genomics, including for variation calling, ChIP-seq, RNA-seq, BS-seq, but here we use it for transcripts mapping for isoform-level expression quantification by RSEM.

RSEM is a software package for estimating gene and isoform expression levels from RNA-seq data. It has been widely looked as one of the most accurate quantification tools for RNA-seq analysis. RSEM wraps other popular tools (i.e. STAR, Bowtie2, HISAT2; Bowtie2 is used in this pipeline) to map reads to the transcriptome before quantifying the gene- and isoform-level expression.

Output directory: results/rseqc/rsem_bowtie2_allgenes/

  • *.genes.results
    • gene-level quantification results for each sample.
  • *.isoforms.results
    • transcript-level quantification results for each sample.
  • *.stat
    • A folder that contains model parameters learned. All model related statistics are stored in this folder.

Output directory: results/merged_output_files/

  • merged_rsemResults_genes_TPM.txt
    • Matrix of gene-level expression quantification results in TPM for all samples.
  • merged_rsemResults_isoforms_TPM.txt
    • Matrix of transcript-level expression quantification results in TPM for all samples.

MultiQC - RSEM Mapped Reads plot

MultiQC - RSEM Multimapping rates plot

MultiQC

MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.

The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability.

Output directory: results/multiqc

  • Project_multiqc_report.html
    • MultiQC report - a standalone HTML file that can be viewed in your web browser
  • Project_multiqc_data/
    • Directory containing parsed statistics from the different tools used in the pipeline

For more information about how to use MultiQC reports, see http://multiqc.info

The following are the sample MultiQC reports output using the test data.

Pipeline information

Output directory: results/pipeline_info

After execution, ramdaq outputs an HTML execution report which includes useful metrics about a workflow execution such as the actual commands executed by each process. (see below for details).

  • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
  • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.csv.