Skip to content

3. RNA Seq Workflow_Bash Script

Fredrick-Kakembo edited this page Aug 26, 2020 · 3 revisions

Workflow

Step 1. Quality check using FastQC and multiqc

Description

"FastQC aims to provide a way to do quality control checks on sequence data. Within the fastq file is quality information that refers to the accuracy of each base call. This helps to determine any irregularies or features that make affect your results such as adapter contamination

Installation

conda install -c bioconda fastqc --yes
conda install -c bioconda multiqc --yes

Command

echo -e "\n Data Preprocessing... \n"

mkdir -p QC_Reports  #create directory for the fastqc output

#Quality Check using 'FastQC'
for sample in `cat sample_id.txt` 
do
	fastqc ${sample}_R1.fastq.gz ${sample}_R2.fastq.gz --outdir QC_Reports 
done

#Aggregate FastQC results using 'MultiQC'
multiqc QC_Reports -o  QC_Reports 

Output

├── multiqc_data
│   ├── multiqc_data.json
│   ├── multiqc_fastqc.txt
│   ├── multiqc_general_stats.txt
│   ├── multiqc.log
│   └── multiqc_sources.txt
├── multiqc_report.html
├── sample37_R1_fastqc.html
├── sample37_R1_fastqc.zip
├── sample37_R2_fastqc.html
├── sample37_R2_fastqc.zip
├── sample38_R1_fastqc.html
├── sample38_R1_fastqc.zip
├── sample38_R2_fastqc.html
├── sample38_R2_fastqc.zip
├── sample39_R1_fastqc.html
├── sample39_R1_fastqc.zip
├── sample39_R2_fastqc.html
├── sample39_R2_fastqc.zip
├── sample40_R1_fastqc.html
├── sample40_R1_fastqc.zip
├── sample40_R2_fastqc.html
├── sample40_R2_fastqc.zip
├── sample41_R1_fastqc.html
├── sample41_R1_fastqc.zip
├── sample41_R2_fastqc.html
├── sample41_R2_fastqc.zip
├── sample42_R1_fastqc.html
├── sample42_R1_fastqc.zip
├── sample42_R2_fastqc.html
└── sample42_R2_fastqc.zip

Step 2. Removing Low Quality Sequences with Trim_Galore

http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/

Description

Trim Galore is a wrapper script to automate quality and adapter trimming as well as quality control.After analyzing the quality of the data, the next step is to remove sequences that do not meet quality standards. Trim_galore combines Cutadapt (http://cutadapt.readthedocs.io/en/stable/guide.html) and FastQC to remove low quality sequences while performing quality analysis to see the effect of filtering. The 2 most import parameters to select are: minimum Phred score and minimum sequencing length. A good estimate is typically a Phred score of 20 (99% confidence) and a minimum of 50-70% of the sequence length.

Installation

conda install -c bioconda trim-galore --yes

Command

mkdir -p Trim_galore

for sample in `cat sample_id.txt`
do
	trim_galore -j 10 --paired ${sample}_R1.fastq.gz ${sample}_R2.fastq.gz -q 25 --length 20 --fastqc -o Trim_galore
done
# -j 10 specifies the number of cores to be used for trimming  
#--paired runs a paired-end validation on both trimmed R1 and R2 FastQ files to removes entire read pairs if at least one of the two sequences became shorter than the threshold
# -q 25 to remove portions of the reads with phred score less than 25 at the 3' end  
# --length 20 to filter trimmed reads less than 20 bp


#Quality Check the trimmed Reads with 'Multiqc'
multiqc Trim_galore -o Trim_galore

Output

├── multiqc_data
│   ├── multiqc_cutadapt.txt
│   ├── multiqc_data.json
│   ├── multiqc_fastqc.txt
│   ├── multiqc_general_stats.txt
│   ├── multiqc.log
│   └── multiqc_sources.txt
├── multiqc_report.html
├── sample37_R1.fastq.gz_trimming_report.txt
├── sample37_R1_val_1_fastqc.html
├── sample37_R1_val_1_fastqc.zip
├── sample37_R1_val_1.fq.gz
├── sample37_R2.fastq.gz_trimming_report.txt
├── sample37_R2_val_2_fastqc.html
├── sample37_R2_val_2_fastqc.zip
├── sample37_R2_val_2.fq.gz
├── sample38_R1.fastq.gz_trimming_report.txt
├── sample38_R1_val_1_fastqc.html
├── sample38_R1_val_1_fastqc.zip
├── sample38_R1_val_1.fq.gz
├── sample38_R2.fastq.gz_trimming_report.txt
├── sample38_R2_val_2_fastqc.html
├── sample38_R2_val_2_fastqc.zip
├── sample38_R2_val_2.fq.gz
├── sample39_R1.fastq.gz_trimming_report.txt
├── sample39_R1_val_1_fastqc.html
├── sample39_R1_val_1_fastqc.zip
├── sample39_R1_val_1.fq.gz
├── sample39_R2.fastq.gz_trimming_report.txt
├── sample39_R2_val_2_fastqc.html
├── sample39_R2_val_2_fastqc.zip
├── sample39_R2_val_2.fq.gz
├── sample40_R1.fastq.gz_trimming_report.txt
├── sample40_R1_val_1_fastqc.html
├── sample40_R1_val_1_fastqc.zip
├── sample40_R1_val_1.fq.gz
├── sample40_R2.fastq.gz_trimming_report.txt
├── sample40_R2_val_2_fastqc.html
├── sample40_R2_val_2_fastqc.zip
├── sample40_R2_val_2.fq.gz
├── sample41_R1.fastq.gz_trimming_report.txt
├── sample41_R1_val_1_fastqc.html
├── sample41_R1_val_1_fastqc.zip
├── sample41_R1_val_1.fq.gz
├── sample41_R2.fastq.gz_trimming_report.txt
├── sample41_R2_val_2_fastqc.html
├── sample41_R2_val_2_fastqc.zip
├── sample41_R2_val_2.fq.gz
├── sample42_R1.fastq.gz_trimming_report.txt
├── sample42_R1_val_1_fastqc.html
├── sample42_R1_val_1_fastqc.zip
├── sample42_R1_val_1.fq.gz
├── sample42_R2.fastq.gz_trimming_report.txt
├── sample42_R2_val_2_fastqc.html
├── sample42_R2_val_2_fastqc.zip
└── sample42_R2_val_2.fq.gz

Step 3. Read Mapping

Description

Reads can either be mapped to the reference if you have a reference genome or transcriptome. If there is no reference sequence, assembly should be done. Hisat2 was used for alignment and Kallisto for Pseudoalignment. For the alignment step, the gene counts are summarized by Subreads featureCounts. For pseudoalignment, counts have already been generated thus there is no need to regenerate them.

Classical alignment based

Installation
conda install -c bioconda hisat2 --yes
conda install -c bioconda subread --yes
Command
#Defining the number of threads to be used
threads=20
#Alignment of sequences to the reference genome using 'HISAT2'

echo -e "\n Now Running Classical Alignment-based with HISAT2... \n"

mkdir -p hisat2

# Download the human reference genome
wget -c ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz -O hisat2/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip  hisat2/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz 

# Download the human annotation file
wget -c ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz -O hisat2/Homo_sapiens.GRCh38.100.gtf.gz
gunzip  hisat2/Homo_sapiens.GRCh38.100.gtf.gz 

#HISAT2 indexing to build the reference genome index 
hisat2-build -p $threads hisat2/Homo_sapiens.GRCh38.dna.primary_assembly.fa hisat2/Homo_sapiens.GRCh38v3_hisat2.idx

#HISAT2 Alignment
for sample in `cat sample_id.txt`
do 
	hisat2 -p $threads -x hisat2/Homo_sapiens.GRCh38v3_hisat2.idx  \
		-1 Trim_galore/${sample}_R1_val_1.fq.gz  -2 Trim_galore/${sample}_R2_val_2.fq.gz  \
		-S   hisat2/${sample}_hisat2.sam

        #compress the sam files to binary format, sort them and index them."
	samtools view -Sb hisat2/${sample}_hisat2.sam  | samtools sort  > hisat2/${sample}_hisat2_sorted.bam 
        samtools index hisat2/${sample}_hisat2_sorted.bam 

	rm  hisat2/${sample}_hisat2.sam #remove the .sam files for storage purposes
done
#-x >The basename of the index for the reference genome.
#-1 > the foward reads of the samples
#-2 > the reverse reads of all the samples
#-S >File to write SAM alignments to, which by default its written to "stdout" or "standard out"


#Quantifying Aligned reads using the Subread package's script 'featureCounts'

featureCounts -T $threads -a hisat2/Homo_sapiens.GRCh38.100.gtf \
	-o hisat2/hisat2_counts.txt \
	hisat2/sample{37..42}_hisat2_sorted.bam

# -T > the number of cores to use
# -a > the annotation file
# -o > the output directory
#then lastly the input files

#Generating analysis report with multiQC
multiqc hisat2 -o multiQCreports
Output
├── hisat2_counts.txt
├── hisat2_counts.txt.summary
├── Homo_sapiens.GRCh38v3_hisat2.idx.1.ht2
├── Homo_sapiens.GRCh38v3_hisat2.idx.2.ht2
├── Homo_sapiens.GRCh38v3_hisat2.idx.3.ht2
├── Homo_sapiens.GRCh38v3_hisat2.idx.4.ht2
├── Homo_sapiens.GRCh38v3_hisat2.idx.5.ht2
├── Homo_sapiens.GRCh38v3_hisat2.idx.6.ht2
├── Homo_sapiens.GRCh38v3_hisat2.idx.7.ht2
├── Homo_sapiens.GRCh38v3_hisat2.idx.8.ht2
├── makeidx.done
├── multiqc_data
│   ├── multiqc_data.json
│   ├── multiqc_featureCounts.txt
│   ├── multiqc_general_stats.txt
│   ├── multiqc.log
│   └── multiqc_sources.txt
├── multiqc_report.html
├── sample37_hisat2_sorted.bam
├── sample37_hisat2_sorted.bam.bai
├── sample38_hisat2_sorted.bam
├── sample38_hisat2_sorted.bam.bai
├── sample39_hisat2_sorted.bam
├── sample39_hisat2_sorted.bam.bai
├── sample40_hisat2_sorted.bam
├── sample40_hisat2_sorted.bam.bai
├── sample41_hisat2_sorted.bam
├── sample41_hisat2_sorted.bam.bai
├── sample42_hisat2_sorted.bam
└── sample42_hisat2_sorted.bam.bai

Pseudo alignment approach

Installation
conda install -c bioconda kallisto --yes
Command
# Aligning reads to the transcriptome using 'kallisto'

echo -e "\n Now Running Kmer-based “Pseudo-alignment” with Kallisto... \n"

mkdir -p Kallisto

# Download the human transcriptome reference
wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_34/gencode.v34.transcripts.fa.gz -O Kallisto/gencode.v34.transcripts.fa.gz

#Build the transcriptome index file
kallisto index Kallisto/gencode.v34.transcripts.fa.gz -i Kallisto/kallisto_index

for sample in `cat sample_id.txt`
do
	kallisto quant -t $threads -b 100 \
		-i Kallisto/kallisto_index* -o Kallisto/${sample}   \
		Trim_galore/${sample}_R1_val_1.fq.gz Trim_galore/${sample}_R2_val_2.fq.gz
done
#-t > the number of threads to be used.
#-b > Number of bootstrap samples which by default is zero
#-i > Filename for the kallisto index to be used for quantification
#-o > the output directory
# then the input files from Trim_galore
Output
├── kallisto_index
├── sample37
│   ├── abundance.h5
│   ├── abundance.tsv
│   └── run_info.json
├── sample38
│   ├── abundance.h5
│   ├── abundance.tsv
│   └── run_info.json
├── sample39
│   ├── abundance.h5
│   ├── abundance.tsv
│   └── run_info.json
├── sample40
│   ├── abundance.h5
│   ├── abundance.tsv
│   └── run_info.json
├── sample41
│   ├── abundance.h5
│   ├── abundance.tsv
│   └── run_info.json
└── sample42
    ├── abundance.h5
    ├── abundance.tsv
    └── run_info.json
echo -e "\n That's All folks!!! Go ahead with the Statistical Analyses in R \n"

References