-
Notifications
You must be signed in to change notification settings - Fork 251
HowTo: convert BAM to cSRA
Your BAM file should be sorted, and should not contain hard-clipped reads. Unsorted input is accepted, but will produce a less useful result. In the examples below, we will refer to the source BAM file as input.bam
and the output cSRA file as output.sra
.
-
locate your reference material
Examine the BAM headers:
$ samtools -H input.BAM
The reference sequences should be exactly those used by the aligner to produce the BAM file. Otherwise, bam-load will produce invalid output. The BAM format is insufficiently specified regarding reference sequences, and bam-load itself has no way of detecting an incorrect reference other than via its length. Furthermore, there is no reliable indication within the BAM file of the location of the original reference. -
prepare a configuration file
The configuration file translates the RNAME in the BAM file to an SRA or FASTA reference file. Its format is simple:
a. one line per RNAME
b. tab-separated columns
c. the first column is RNAME as used in the BAM file
d. the second column is either an SRA reference accession or a local sequence name.
If the column holds an SRA reference, bam-load will attempt to remotely fetch it if the toolkit is configured for remote access. Otherwise, the local sequence name will be the base name of a FASTA file, without extension or directory path.
e. the third column (optional) may contain the wordCIRCULAR
(all-caps) if the reference is circular.
.
The following example shows mappings between ambiguous (but common) numeric reference designations in column one to fully specified accession numbers in column two:
1 CM000663.1
2 CM000664.1
3 CM000665.1
4 CM000666.1
5 CM000667.1
6 CM000668.1
7 CM000669.1
8 CM000670.1
9 CM000671.1
10 CM000672.1
11 CM000673.1
12 CM000674.1
13 CM000675.1
14 CM000676.1
15 CM000677.1
16 CM000678.1
17 CM000679.1
18 CM000680.1
19 CM000681.1
20 CM000682.1
21 CM000683.1
22 CM000684.1
X CM000685.1
Y CM000686.1
MT NC_012920.1
-
pre-flight your configuration
$ bam-load --only-verify -o output -k input.bam.cfg input.bam
This command will not generate an SRA file, but will test that bam-load can find the reference files through the configuration file. It will check their length and md5 (if present) and will warn if the length does not match. -
ensure you have enough disk space and RAM
bam-load performs spot-assembly to unite mate-pairs. This can be a lengthy and resource-intensive process, and will run best with a lot of memory. The actual amount depends upon the input file, but plan on allocating 16GB of RAM per process. -
run bam-load
A simple example:bam-load -o output -k input.bam.cfg input.bam
-
produce a single-file SRA archive
SRA loaders produce directories of database components. These contain many files. Use the tool kar to produce a single-file archive of the SRA:
$ kar --create output.sra --directory output
- use a machine with a higher amount of RAM available
- use
--tmpfs
to allocate scratch space on an SSD