This pipeline was created as a way to integrate multiple methods and tools for analysis of clinical data - specifically in regards to analysing COVID19-related data. This includes mapping RNA-seq reads to a reference genome, carrying out variant calling (on RNA-seq data) and running circular RNA and micro-RNA prediction. It utilises Nextflow in order to carry out most of the process in parallel and on demand when the required input files are available, reducing the run time dramatically.
Provided additionally is a Guix manifest file, which allows for running temporary containers via guix shell
or also generating singularity or docker images via the guix pack --format=docker
and guix pack --format=squashfs
commands.
To download the pipeline, run
git clone https://github.com/dmgie/ClInt.git
(please note the capital "I" in case errors arise during git clone
)
Please also ensure that Nextflow
has been installed on the system.
Afterwards, the pipeline can be run using nextflow run ./main.nf <...>
where <...>
are various parameters outlined in Instructions and Usage
below.
The pipeline utilises the following programs
- CircRNA: DCC (circtools)
- MiRNA: miRanda, TargetScan
- Mapping: STAR, hisat
- Variant Calling: GATK
- Others: Nextflow, Singularity
Nextflow - This can be installed from their website: (here)[https://www.nextflow.io/]
Singularity - The pipeline utilises various containers in order to supply the necessary programs. These are handled within the scripts (and each process) via Singularity
containers.
To use the pipeline, all that is needed is to provide the necessary files in the command line when calling the Nextflow command. An example is given below.
nextflow run main.nf --reference_file ../ref_genome.fna --input_dir ../reads_data/ --gff_file ../ref_genome.gff --output_dir ./results -profile singularity
These parameters can also be given in the form a YAML/JSON formatted nextflow configuration file and provided with the -params-file
parameter (more information in (Nextflow's Documentation)[https://www.nextflow.io/docs/latest/config.html]).
The output of the pipeline will then be provided in the foldering given to the --output_dir
parameter.
Instead of providing an input folder of reads, it is also possible to provide a samplesheet via the --samplesheet
parameter. This is formatted as a CSV in 3 columns with "sample,fastq_1,fastq_2" as the headers, where sample
is an identifier for a set of paired-end reads.
Running inside Singularity:
All that is required to run the pipeline, is to provide either a local path to the main "ClInt.sif" singularity container, or a link to a library which hosts the container. To do this, inside the nextflow.config
file, modify the singularity profile - specifically the container
variable inside the process
block. Either provide a relative (file://<relative_path>
), absolute path (file:///<absolute_path>
) or a link. If given a link, it will download the containers into a temporary/cache directory, defined by the system's SINGULARITY_CACHEDIR
environment variable.
- [] Submit singularity image to a accessible repo to avoid need of building container
- [] The current implementation of the circRNA section of the pipeline is still incomplete. It has been heavily adapted from
nf-core/circRNA
in its current state. It is possible to run the circRNA detection and miRNA prediction steps separately using the output of the mapping step of the pipeline.