We present a novel RNA sequencing (whole transcriptomics) analysis pipeline that quantifies gene/transcript counts from raw RNA sequencing reads based on alignment to the target organism's genome. This pipeline leverages on Linux, python and R packages to identify transcript packaging. Our pipeline integrates packages from the aforementioned tools for quality control, data pre-processing, differential gene analysis, and provides a comprehensive framework for RNA packaging studies. We demonstrate this pipeline’s performance and applicability to datasets from different organisms and biofluids.
Nextflow is a workflow management system that allows for writing scalable and reproducible pipelines for automating analysis. It simplifies the execution of several complex tasks across computing platforms and environments. Singularity is a container manager that enables the running of applications in isolated and portable environments. It helps avoid conflicts of dependencies and ensures reproducibility results. Anaconda is a software distribution that provides a large repertoire of tools and libraries for data analysis, including sequencing-specific packages. It also allows the creation and management of virtual environments for different projects. Together, these software are deployed to provide a flexible, robust and efficient framework for RNA sequencing analysis.
We suggest using anaconda, which is a distribution of packages built for data science. Anaconda comes with conda, a package, and an environment manager that can be used to create environments for isolating projects that use different versions of Python and it’s packages.
-
Download the Anaconda installer for Linux from the official website.
-
Open a terminal window and navigate to the directory where you downloaded the installer.
-
Run the following command to add executable permission to the installer:
chmod +x Anaconda3-YOUR_VERSION-Linux-x86_64.sh
-
Run the installer by running the following command:
./Anaconda3-YOUR_VERSION-Linux-x86_64.sh
-
Follow the prompts on the installer screens.
-
Once installation is complete, you can start using Anaconda by opening a new terminal window.
Before installing Nextflow, you will need to make sure that Java version 11 or greater is installed on your machine. You can check your Java version by running the following command in your terminal window:
java -version
-
Download the Nextflow executable by copying and pasting one of the following commands in your terminal window:
wget -qO- https://get.nextflow.io | bash
or
curl get.nextflow.io | bash
-
Make the binary executable on your system by running:
chmod +x nextflow
-
Optionally, move the nextflow file to a directory accessible by your $PATH:
sudo mv nextflow /usr/local/bin/
- If you want to install Nextflow through Anaconda, you can do so by running:
conda create -n nextflow -c bioconda nextflow
Singularity is a free and open-source computer program that performs operating-system-level virtualisation also known as containerization. One of the main uses of Singularity is to bring containers and reproducibility to scientific computing and the high-performance computing (HPC) world. Singularity is a container framework designed to run scientific applications on HPC- based resources.
Singularity is chosen by most HPCs as their primary container software because it allows users to pack an application/workflow/pipeline and all of its dependencies into a single image (file). This file can be easily transported between different HPC systems. Furthermore, Singularity assumes that the user does not have root privileges on the host operating system (OS), making it more secure than other containerization technologies.
-
Install dependencies:
sudo apt-get update sudo apt-get install -y build-essential libssl-dev uuid-dev libgpgme11-dev squashfs-tools libseccomp-dev wget pkg-config git cryptsetup-bin
-
Install Go:
-
Download the Go binary for Linux by going to the site go.dev and then clicking on Download Go.
-
Extract the Golang binaries tarball using the tar command to a directory of your choice:
tar -C /usr/local -xzf go$VERSION.$OS-$ARCH.tar.gz
-
Extract the Golang binaries tarball using the tar command to a directory of your choice:
export PATH=$PATH:/usr/local/go/bin
-
Extract the Golang binaries tarball using the tar command to a directory of your choice:
go version
-
-
Download Singularity from a release:
VERSION=3.8.0 wget https://github.com/hpcng/singularity/releases/download/v${VERSION}/singularity-${VERSION}.tar.gz tar -xzf singularity-${VERSION}.tar.gz cd singularity-${VERSION}
-
Compile Singularity:
./configure --prefix=/usr/local make sudo make install
-
Verify that Singularity is installed correctly:
singularity --version
Alternatively if you are using CAIR you can enter the following commands to activate singularity:
-
load the GO module:
module load application/go/1.14.2
-
load the Singularity module:
module load application/singularity/3.5.3
*NOTE: The only version of singularity available on CAIR is 3.5.3, this should not cause issues with the execution of the pipeline.
conda activate nextflow
nextflow run /research/project/shared/benoukraf_lab/nextflow/microbiome_pipeline/main.nf -profile chia,ont,conda --input fastq/00811_8.fastq --target_index "/research/project/shared/benoukraf_lab/pathoscope/metascope/index/refseq_all/ont/*-ont.mmi" --filter_index /research/project/shared/benoukraf_lab/pathoscope/metascope/index/human/ont/hg38-ont.mmi
conda activate nextflow
nextflow run Matthew-Dyer792/microbiome_pipeline -profile chia,ont,conda --conda_cacheDir /research/project/shared/benoukraf_lab/matthew/.conda_cacheDir --input fastq/00811_8.fastq --sequence_summary "summary/*.txt" --target_index "/research/project/shared/benoukraf_lab/pathoscope/metascope/index/refseq_all/ont/*-ont.mmi" --filter_index /research/project/shared/benoukraf_lab/pathoscope/metascope/index/human/ont/hg38-ont.mmi --ncbi_key t28545asd34jjhgjg2342
This is a path that leads to the folder containing the Microbiome Pipeline. It can be either a local path i.e. "/research/project/shared/benoukraf_lab/nextflow/microbiome_pipeline"
or a github repository "Matthew-Dyer792/microbiome_pipeline"
.
run Matthew-Dyer792/microbiome_pipeline
These are preset configurations that allow for standardize execution of the pipeline. For example there is a profile for conda and singularity, chia vs. local execution, and for oxford nanopore vs. illumina reads.
-profile illumina
or
-profile ont,conda,chia
There are two workflow options with this pipeline, the primary is to run the metascope package on illumina reads. The secondary workflow recapitulates the metascope package but for oxford nanopore long reads. They can be selected with either illumina or ont. Note: Using the ont profile will set the workflow value to ont by default.
--workflow illumina
By default nextflow will store the conda files in the "work" directory. This file directory is usually discarded upon successful completion of the pipeline. Therefore if you want the conda directories to persist a external directory must be supplied. Note: If you intend to run this pipeline multiple times this will save the time and hard drive space of duplicate installations.
--conda_cacheDir /research/project/conda
The primary input of the pipeline is fastq files. They can be provided as a regex "/data/big/*.fq" or for paired end reads "data/big/file_{1,2}.fq". Note: the path must be encapsulated by "" in order to be properly interpreted.
--input "/research/project/fastq/*_{1,2}.fq"
By default the pipeline assumes that paired end reads are provided. If single end reads are used set the flag to "true". Note: Using the ont profile will set single end to true by default.
--single_end true
The sequencing_summary.txt files created along with the oxford nanopore sequencing files. These are required for pycoqc. They can be provided as a regex "/data/big/*.txt". Note: The path must be encapsulated by "" in order to be properly interpreted.
--sequence_summary "/research/project/summary/*.txt"
The genomes of microbes you aim to discover in your sample indexed by your aligner of choice (bowtie2, minimap2, bwa...). They can be provided as a regex "/data/big/*.fq". Note: the path must be encapsulated by "" in order to be properly interpreted.
--target_index "/research/project/target_genomes/*-ont.mmi"
The genomes of the host organism (or confounding organisms) you aim to remove from your sample indexed by your aligner of choice (bowtie2, minimap2, bwa...). They can be provided as a regex "/data/big/*.mmi". Note: the path must be encapsulated by "" in order to be properly interpreted.
--filter_index "/research/project/filter_genomes/*-ont.mmi"
NCBI Entrez API key. optional. Due to the high number of requests made to NCBI, the ID function will be less prone to errors if you obtain an NCBI key. You may enter the string from your account.
--ncbi_key t28545asd34jjhgjg2342
A built in nextflow feature to allow resumption of the pipeline using a cached version of the previously completed steps. Note: "work" directory created by the pipeline must be present for this to function.
-resume