Builds a transcriptome independent of a reference genome using Oxford Nanopore Technologies and optionally Illumina RNA-seq data (recommended)
The easiest way to install CONDUIT on MacOS and Linux is via Bioconda. The command for which is:
conda install -c bioconda conduit-assembler
CONDUIT is built in Nim (A statically typed, compiled systems programming language with python-like syntax).
CONDUIT therefore requires a Nim installation. Easy Nim installation instructions can be found here
CONDUIT also uses the following libraries:
- htslib, a C library for interfacing with common bioinformatics file formats.
- poaV2, the necessary files of which are distributed in the CONDUIT GitHub and need not be downloaded separately.
- threadpools, which provides instance threadpools in nim, the necessary files of which are included as a submodule in CONDUIT, and need not be downloaded separately if the conduit repo is downloaded with the
--recursive
flag. - nim-hts, which wraps htslib in nim and will be installed by nimble automatically if you are missing it. And the following tools must be installed:
- samtools
- bowtie2
Once all the required libraries and tools are installed, building CONDUIT can be done in the following manner:
git clone --recursive https://github.com/NatPRoach/conduit.git
cd conduit
nimble install -y --verbose
This should result in a conduit
binary file which can then be used.
CONDUIT requires as input reads clustered at the gene level, with one gene level cluster per FASTA or FASTQ file located in the same directory. For this purpose we reccomend the use of RATTLE gene level clustering, which outperforms minimizer based clustering per the RATTLE preprint.
Therefore, if one has a dRNAseq fastq file nano_reads.fastq
, first install and run RATTLE clustering:
path/to/rattle/binary/rattle cluster -i path/to/nano_reads.fastq --rna --fastq -o path/to/gene/clusters/
(Note that reads clustered by RATTLE must be longer than the k used in the clustering, and so it is reccomended reads be filtered for reads longer than 150 bp, this can be done easily with the command awk 'BEGIN{OFS="\n"} {header = $0 ; getline seq; getline qheader; getline qseq; if (length(seq) >= 150) {print header,seq,qheader,qseq}}' nano_reads.fq > nano_reads.filtered.fq
)
RATTLE clustering outputs a binary file, clusters.out
, summarizing the clusters extracted. At the moment, CONDUIT requires these clusters to be extracted from the clusters.out
file RATTLE produces into one FASTQ or FASTA file per gene cluster, all located in the same directory. Because CONDUIT can in theory polish sufficiently accurate single read clusters, -m
can be set as low as 1
, though higher -m
values will result in more stringent and accurate final clusters (though --stringent
mode should overcome this loss in stringency, and so we reccomend running with -m 1
if running in hybrid
--stringent
mode):
path/to/rattle/binary/rattle extract_clusters -i path/to/nano_reads.fastq -c path/to/gene/clusters/clusters.out --fastq -m 1 -o path/to/gene/clusters/
Once clusters have been extracted by RATTLE, CONDUIT is ready to run. If you only have nanopore data, you can run in nano
mode, though hybrid
mode is strongly encouraged as nano
mode extracted transcripts will still have a reasonably high level of error.
An example CONDUIT command is provided below, for more advanced usage check the --help
statement or the usage statement below:
path/to/conduit/binary/conduit hybrid -o path/to/outdir/ --tmp-dir path/to/tmpdir/ path/to/gene/clusters/ -1 path/to/illumina/mate_1.fastq.gz -2 path/to/illumina/mate_2.fastq.gz
Both RATTLE gene level clustering and CONDUIT consensus extraction work better if reads are trimmed of adapters before use. For cDNA, porechop is a good adapter trimming tool (though it is no longer supported). For dRNA we've had success with Poreplex, however this tool is based on the Albacore basecaller, not the most recent guppy basecallers. Up to date ont-guppy basecallers have built in adapter trimming and should suffice for this purpose as well, for both cDNA and dRNA reads.
CONDUIT - CONsensus Decomposition Utility In Transcriptome-assembly:
CONDUIT Version 0.1.2 by Nathan Roach ( [email protected], https://github.com/NatPRoach/conduit/ )
Usage:
./conduit hybrid [options] <clusters_directory> {-1 <m1> -2 <m2> | -U <r> | --interleaved <i> | -b <bam>}
<clusters_directory> Directory containing the .fasta/.fa or .fastq/.fq files of reads separated by gene cluster
NOTE: .gz support coming for nanopore scaffold data, but is not an option at this time
Illumina data is aligned with Bowtie2, therefore Illumina data is provided in the same format as Bowtie2, namely:
<m1> Files with #1 mates, paired with files in <m2>
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<m2> Files with #2 mates, paired with files in <m1>
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<r> Files with unpaired reads
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<i> File with interleaved paired-end FASTQ/FASTA reads
Could be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2).
<bam> Files are unaligned BAM sorted by read name.
<m1>, <m2>, <r> can be comma-separated lists (no whitespace) and can be specified many times.
E.g. '-U file1.fq,file2.fq -U file3.fq'.
Options (defaults in parentheses):
Scaffold Type:
--drna (default)
Scaffold reads are stranded forward relative to coding strand, and may contain U characters instead of Ts
--cdna-rev-stranded
Scaffold reads are stranded reverse complemented relative to coding strand
--cdna
Scaffold reads are NOT stranded
--sfq (default)
Scaffold reads are in FASTQ format
--sfa
Scaffold reads are in FASTA format
Illumina Type:
-u, --unstranded
Illumina reads are unstranded
-f, --fwd-stranded
Illumina reads are stranded s.t. the first mate originates from the RNA strand
Ignored if scaffold reads are not stranded
-r, --rev-stranded (default)
Illumina reads are stranded s.t. the first mate is the reverse complement of the RNA strand
Ignored if scaffold reads are not stranded
--ifq (default)
Illumina reads are in FASTQ format; Mutually exclusive with --ifa
--ifa
Illumina reads are in FASTA format; Mutually exclusive with --ifq
Consensus Collapsing:
-m, --score-matrix <path>
Provide an alternative scoring matrix to use in partial order alignment
Example formatting for the score matrix can be found at poaV2/myNUC3.4.4.mat
-d, --isoform-delta (35)
Maximum indel size to be 'corrected', beyond this size a new isoform is declared. Must be between 0 and 255
-e, --ends-delta (35)
Maximum size at the ends of isoforms to 'correct' before splitting. Must be between 0 and 255
-i, --max-iterations (5)
Maximum number of iterations to align to and correct scaffolds. Does not include optional final polshing step
Note: Providing a value of 0 will not perform any graph based illumina correction
-w, --illumina-weight (10)
Weight of illumina reads relative to nanopore reads when generating consensus
--final-polish (default)
Include a final correction of individual isoforms, not in a splice graph
--no-final-polish
Do not do a final correction of individual isoforms, not in a splice graph
Ouput:
-o, --output-dir <path> (conduit/)
<path> where corrected clusters will be written
NOTE: THIS WILL OVERWRITE EXISTING FILES!
-n, --no-intermediates (default)
Does not save FASTA file generated for intermediate rounds of polishing
-s, --save-intermediates
Saves the FASTA file generated for intermediate rounds of polishing
Bowtie2:
--end-to-end (default)
Align Illumina reads to ONT scaffolds in end-to-end alignment mode
--local
Align Illumina reads to ONT scaffolds in local alignment mode
Miscellaneous:
-h, --help
Display this help message and exit
-v, --version
Display the installed version number of CONDUIT and exit
--tmp-dir <path> (conduit-tmp/)
<path> where temporary files will be created
-t, --threads (4)
Number of threads to run in parallel (used for both Bowtie2 and Partial Order Graph correction)