C++ interface to HTSlib, BWA-MEM and Fermi
License: Apache2
If you use SeqLib in your applications, please cite: http://bioinformatics.oxfordjournals.org/content/early/2016/12/21/bioinformatics.btw741.full.pdf+html
Note that the values for the SeqAn benchmarking in Table 2 should be corrected to 7.7 Gb memory and 33.92 seconds in CPU time, when compiling SeqAn with -O3 -DNDEBUG
. SeqAn also does full string decompression.
Wall times for SeqAn may be shorter than CPU time because it uses embedded multi-threading during BAM IO.
#######
git clone --recursive https://github.com/walaj/SeqLib.git
cd SeqLib
## cd htslib && ./configure --enable-libcurl && cd .. # support for remote (FTP/HTTPS/Google etc) BAM access
./configure ## or: ./configure LDFLAGS='-lcurl -lcrypto' # for remote support
make
make install
make seqtools ## for the command line version
You will need to have JsonCPP (v1.9.4 or higher) installed.
I have successfully compiled with GCC-4.5+ and Clang on Linux and OSX.
SeqLib is compatible with c++11 and later.
After building, you will need to add the relevant header directories:
SEQ=<path_to_seqlib_git_repos>
C_INCLUDE_PATH=$C_INCLUDE_PATH:$SEQ:$SEQ/htslib
And need to link the SeqLib static library and Fermi, BWA and HTSlib libraries
SEQ=<path_to_seqlib>
LDFLAGS="$LDFLAGS $SEQ/bin/libseqlib.a $SEQ/bin/libbwa.a $SEQ/bin/libfml.a $SEQ/bin/libhts.a"
To add support for reading BAMs, etc with HTTPS, FTP, S3, Google cloud, etc, you must compile and link with libcurl.
## set hts to build with libcurl links and hfile_libcurl.c
cd SeqLib/htslib
./configure --enable-libcurl
## compile seqlib with libcurl support
cd ../ # back to SeqLib main directory
./configure LDFLAGS="-lcurl -lcrypto"
make
make install
Remember then to then link any projects made with SeqLib with the additional -lcurl -lcrypto
flags.
SeqLib is a C++ library for querying BAM/SAM/CRAM files, performing BWA-MEM operations in memory, and performing sequence assembly. Core operations in SeqLib are peformed by:
The primary developer for these three projects is Heng Li.
SeqLib also has support for storing and manipulating genomic intervals via GenomicRegion
and GenomicRegionCollection
.
It uses an interval tree (provided by Erik Garrison @ekg) to provide for rapid interval queries.
SeqLib is built to be extendable. See VariantBam for examples of how to take advantage of C++ class extensions to build off of the SeqLib base functionality.
SeqLib is built to automatically handle memory management of C code from BWA-MEM and HTSlib by using C++ smart
pointers that handle freeing memory automatically. One of the
main motivations behind SeqLib is that all access to sequencing reads, BWA, etc should
completely avoid malloc
and free
. In SeqLib all the mallocs/frees are handled automatically in the constructors and
destructors.
There are overlaps between this project and BamTools from Derek Barnett, Gamgee from the Broad Institute, and SeqAn from Freie Universitat Berlin. These projects provide excellent and high quality APIs. SeqLib provides further performance enhancement and new capabilites for certain classes of bioinformatics problems.
Some differences:
- SeqLib has ~2-4x faster read/write speed over BamTools and lower memory footprint.
- SeqLib has support for CRAM file
- SeqLib provides in memory access to BWA-MEM, BLAT, chromosome aware interval tree, read correction, and sequence assembly with Fermi.
- SeqAn provide a substantial amount of additional capabilites not in SeqLib, including graph operations and an expanded suite of multi-sequence alignments.
- SeqAn embeds multi-threading into some functionality like BAM IO to improve wall times.
For your particular application, our hope is that SeqLib will provide a comprehensive and powerful envrionment to develop bioinformatics tools, or to be used in conjuction with the capablities in SeqAn and BamTools. Feature requests and comments are welcomed.
## BFC correction (input mode -m is b (BAM/SAM/CRAM), output mode -w is SAM stream
samtools view in.bam -h 1:1,000,000-1,002,000 | seqtools bfc - -G $REF | samtools sort - -m 4g -o corrected.bam
## Without a pipe, write to BAM
seqtools bfc in.bam -G $REF -b > corrected.bam
## Skip realignment, send to fasta
seqtools bfc in.bam -f > corrected.fasta
## Input as gzipped or plain fasta (or fastq), send to aligned BAM
seqtools bfc --infasta in.fasta -G $REG -b > corrected.bam
##### ASSEMBLY (same patterns as above)
samtools view in.bam -h 1:1,000,000-1,002,000 | seqtools fml - -G $REF | samtools sort - -m 4g -o assembled.bam
#include "SeqLib/RefGenome.h"
#include "SeqLib/BWAWrapper.h"
using namespace SeqLib;
RefGenome ref;
ref.LoadIndex("hg19.fasta");
// get sequence at given locus
std::string seq = ref.QueryRegion("1", 1000000,1001000);
// Make an in-memory BWA-MEM index of region
BWAWrapper bwa;
UnalignedSequenceVector usv = {{"chr_reg1", seq}};
bwa.ConstructIndex(usv);
// align an example string with BWA-MEM
std::string querySeq = "CAGCCTCACCCAGGAAAGCAGCTGGGGGTCCACTGGGCTCAGGGAAG";
BamRecordVector results;
// hardclip=false, secondary score cutoff=0.9, max secondary alignments=10
bwa.AlignSequence(querySeq, "my_seq", results, false, 0.9, 10);
// print results to stdout
for (auto& i : results)
std::cout << i << std::endl;
#include "SeqLib/BamReader.h"
#include "SeqLib/BWAWrapper.h"
using namespace SeqLib;
// open the reader BAM/SAM/CRAM
BamReader bw;
bw.Open("test.bam");
// open a new interface to BWA-MEM
BWAWrapper bwa;
bwa.LoadIndex("hg19.fasta");
// open the output BAM
BamWriter writer; // or writer(SeqLib::SAM) or writer(SeqLib::CRAM)
writer.SetWriteHeader(bwa.HeaderFromIndex());
writer.Open("out.bam");
writer.WriteHeader();
BamRecord r;
bool hardclip = false;
float secondary_cutoff = 0.90; // secondary alignments must have score >= 0.9*top_score
int secondary_cap = 10; // max number of secondary alignments to return
while (GetNextRecord(r)) {
BamRecordVector results; // alignment results (can have multiple alignments)
bwa.AlignSequence(r.Sequence(), r.Qname(), results, hardclip, secondary_cutoff, secondary_cap);
for (auto& i : results)
writer.WriteRecord(i);
}
#include "SeqLib/FermiAssembler.h"
using namespace SeqLib;
FermiAssembler f;
// read in data from a BAM
BamReader br;
br.Open("test_data/small.bam");
// retreive sequencing reads (up to 20,000)
BamRecord r;
BamRecordVector brv;
size_t count = 0;
while(br.GetNextRead(r) && count++ < 20000)
brv.push_back(r);
// add the reads and error correct them
f.AddReads(brv);
f.CorrectReads();
// peform the assembly
f.PerformAssembly();
// retrieve the contigs
std::vector<std::string> contigs = f.GetContigs();
// write as a fasta to stdout
for (size_t i = 0; i < contigs.size(); ++i)
std::cout << ">contig" << i << std::endl << contigs[i] << std::endl;
using namespace SeqLib;
BamReader r;
r.Open("test_data/small.bam");
GenomicRegion gr("X:1,002,942-1,003,294", r.Header());
r.SetRegion(gr);
SeqPlot s;
s.SetView(gr);
BamRecord rec;
BamRecordVector brv;
while(r.GetNextRecord(rec))
if (!rec.CountNBases() && rec.MappedFlag())
brv.push_back(rec);
s.SetPadding(20);
std::cout << s.PlotAlignmentRecords(brv);
Trimmed output from above (NA12878):
CTATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTGTCCATCCATCCATCCATCCA
CTATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTAT CATCCATCCATCCATCCATCCACCCATTCATCCATCCACCTATCCATCTATCAATCCATCCATCCATCCA
TATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCA
TATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTGTCCATCCATCCATCCATC
TATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCA
ATCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCAT
TCTATCTATCTCTTCTTCTGTCCGCTCATGTGTCTGTCCATCTATCTATC GTCCATCCATCCATCCATCCATCCATCCACCCATTCATCCATCCACCTATCCATCTATCAATCCATCCATCCATCCATCCGTCTATCTTATGCATCACAGC
TCTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCATT
CTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCATTC
CTATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCATTC
ATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCATTCAT
ATCTATCTCTTCTTCTGTCCGTTCATGTGTCTGTCCATCTATCTATCCATCTATCTATCATCTAACTATCTG----TCCATCCATCCATCCATCCACCCATTCAT
using namespace SeqLib;
#include "SeqLib/BamReader.h"
BamReader r;
// read from multiple streams coordinate-sorted order
r.Open("test_data/small.bam");
r.Open("test_data/small.cram");
r.Open("test_data/small.sam");
BamWriter w(SeqLib::SAM); // set uncompressed output
w.Open("-"); // write to stdout
w.SetHeader(r.Header()); // specify the header
w.WriteHeader(); // write out the header
BamRecord rec;
while(r.GetNextRecord(rec))
w.WriteRecord(rec);
w.Close(); // Optional. Will close on destruction
Perform error correction on reads, using BFC
#include "SeqLib/BFC.h"
using namespace SeqLib;
// brv is some set of reads to train the error corrector
for (BamRecordVector::const_iterator r = brv.begin(); r != brv.end(); ++r)
b.AddSequence(r->Sequence().c_str(), r->Qualities().c_str(), r->Qname().c_str());
b.Train();
b.clear(); // clear the training sequences. Training parameters saved in BFC object
// brv2 is some set to correct
for (BamRecordVector::const_iterator r = brv2.begin(); r != brv2.end(); ++r)
b.AddSequence(r->Sequence().c_str(), r->Qualities().c_str(), r->Qname().c_str());
b.ErrorCorrect();
// retrieve the sequences
UnalignedSequenceVector v;
std::string name, seq;
while (b.GetSequences(seq, name))
v.push_back({name, seq});
This project is being actively developed and maintained by Jeremiah Wala ([email protected]).
We would like to thank Heng Li (htslib/bwa/fermi), Erik Garrison (interval tree), Christopher Gilbert (aho corasick), and Mengyao Zhao (sw alignment), for providing open-source and robust bioinformatics solutions.
Development, support, guidance, testing:
- Steve Huang - Research Scientist, Broad Institute
- Steve Schumacher - Computational Biologist, Dana Farber Cancer Institute
- Cheng-Zhong Zhang - Research Scientist, Broad Institute
- Marcin Imielinski - Assistant Professor, Cornell University
- Rameen Beroukhim - Assistant Professor, Harvard Medical School