Tapioca is a pipeline for Illumina Casava 1.8 genome analyzer/hiseq data. Main features:
- contaminant filtering
- fastq statistical summary
- collating/binning of casava chunks
(why tapioca? "In Brazil, the plant (cassava) is named "mandioca", while its starch is called "tapioca" https://en.wikipedia.org/wiki/Tapioca )
- Casava 1.8 http://support.illumina.com/sequencing/sequencing_software/casava.ilmn
- Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2/
- fqutils https://github.com/crowja/fqutils
- tpipe http://www.eurogaran.com/index.php/es/component/remository/tpipe/ (also see Unix Power Tools http://shop.oreilly.com/product/9780596003302.do)
- Make, gzip, just standard Linux utilties
- Perl5 http://perl.org
- Perl modules (many of which are already in your perl distro) Bio::SeqReader::Fastq; Cwd; File::Basename; File::Path; File::Spec; File::Which; Getopt::Long; IO::File; IO::Uncompress::AnyUncompress; IO::Uncompress::Gunzip Readonly; Term::ANSIColor; XML::Simple;
In addition to the software dependencies, you'll need
- A directory containing your Illumina sequencing instrument output.
- Two bowtie libraries for contaminant filtering. We created one called phix and one called 'other' for adapters and primers.
- Make a new directory for the Casava & Tapioca output. Dont work in the instrument's output directory.
mkdir tap-work
cd tap-work
- Create file samplesheet.csv. Either using Illumina's experiment manager software, or by a script to pull data from your internal LIMS. The samplesheet.csv format is described in Illumina's documentation.
- Run Casava 1.8 to generate an Unaligned/ directory and makefile. Example:
configureBclToFastq.pl \
--input-dir /your/instrument/output/run_flowcell/Data/Intensities/BaseCalls/ \
--output-dir ./Unaligned \
--sample-sheet samplesheet.csv \
--with-failed-reads
note: It is recommended to use option --with-failed-reads, then tapioca will later separate failed chastity reads into a separate file. See Casava user's guide for other options, e.g. --use-bases-mask etc.
- Start Casava by cd into Unaligned and running make.
cd Unaligned
make
# or make -j [cores]
- After Casava make finishes then configure Tapioca by running tap_configure_postprocessing. The last parameter is the Unaligned directory created by Casava 1.8. Like Casava, Tapioca uses Make for dependency tracking and job parallelism, so a makefile is the output of the configuration script.
cd ..
export PATH=/your/tapioca/bin:$PATH
tap_configure_postprocessing \
--contam-phix-index /your/contam_libs/tapioca_phix_contam \
--contam-phix-pct 80 \
--contam-other-index /your/contam_libs/tapioca_other_contam \
--contam-other-pct 20 \
--deployed /your/deployed/dir \
./Unaligned
- Now the makefile was created. First run the precheck target; it does some sanity checking on the casava run and will output some warnings if it notices anything wrong off the bat.
make precheck
- Making the 'all' target will perform the contaminant filtering and summary reporting. Technically it is not 'all' because the deploy step is a separate target.
make all
# or make -j [cores] all
make -j 16 will use 16 cores. Alternately, the qmake script could be submitted to a SGE cluster if more parallelism is required. Qmake job submission has not been tested.
- Now check results as necessary, in the various ./Project directories. Run make deploy when ready
make deploy
The deploy target collates all the chunks of data from the casava output into the --deployed directory. You could add more targets to the makefile to perform additional processing after the deploy is finished.
Look in the Deployed directory. It should be pretty self explanatory how things are organized by subdirectory. Sorry this is not better documented.
There is no 'make clean' target, and please be aware the intermediate Project directories created by tapioca have uncompressed fastq files in them, and so should not be left on disk long term. Delete the directories yourself.
John Crow https://github.com/crowja , Alex Rice ([email protected])
# Tapioca
# Copyright (C) 2013 National Center for Genome Resources - http://ncgr.org
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.