-
Notifications
You must be signed in to change notification settings - Fork 13
GBWT Merging
General...
If the merging encounters an unrecoverable error, it terminates the program with std::exit(EXIT_FAILURE)
.
See also: Direct construction
The basic merging algorithm extracts the sequences from a compressed GBWT and inserts them into a dynamic GBWT. It is around 2 times slower than direct construction. This algorithm is single-threaded.
This algorithm is mostly useful for merging small GBWTs when the original sequences are no longer available.
When the node ids do not overlap between the GBWTs, the record for a particular node will be identical in the merged GBWT and the source GBWT containing that node. We can then merge the GBWTs quickly by interleaving the records. This algorithm is single-threaded.
The fast algorithm is inteded for merging the GBWTs for different chromosomes of the same genome. By building the chromosome-specific indexes in parallel and merging them into a whole-genome GBWT, we can significantly reduce the wall-clock time required for index construction.
This is an adaptation of the BWT-merge algorithm for GBWT. The insertion algorithm searches for the new sequences in the current index, determines the lexicographic rank of each prefix of each sequence in the index, and updates the index immediately. In contrast, the parallel algorithm stores the lexicographic ranks, sorts them to form the rank array, and uses the rank array to interleave the GBWTs.
In the search phase, a number of search threads extract the new sequences from a dynamic GBWT and search for them in the current index (also a dynamic GBWT). The sequences are assigned dynamically to the search threads. If the sequences are short or their length varies, it may be useful to increase the chunk size to assign multiple sequences at once to each thread. Each thread stores the positions (lexicographic ranks) it encounters in its position buffer. When the position buffer gets full, it is sorted, gap-encoded and merged with the similarly compressed thread buffer. When the thread buffer gets full, the search thread tries to insert it into the global merge buffers. There are M merge buffers, with buffer i containing 2^i merged thread buffers. If the merge buffers are also full, the search thread merges them and writes the 2^M merged thread buffers into a temporary file (see global settings).
The merge phase takes the rank array stored in the temporary files, merges it, and uses it to interleave the two dynamic GBWTs. There is a separate reader thread for reading, decompressing, and buffering each file. A merge thread reads the streams from each reader thread, merges them, and buffers the results. The main read reads the streamed rank array from the merge thread and uses it to interleave the GBWTs.
This algorithm is intended for merging GBWTs over the same chromosome. Separate jobs build GBWTs for batches of e.g. 2500 samples, and the resulting partial indexes are merged using the parallel algorithm. Because the direct construction algorithm is essentially sequential, the parallel merging algorithm can speed up index construction by a factor of 2 to 3.
Existing indexes can be merged with merge_gbwt
.
merge_gbwt [options] -o output input1 input2 [input3 ...]
The program reads input1.gbwt
, inserts the sequences from input2.gbwt
(and further inputs) to it, and writes the merged index to output.gbwt
. For best performance, input1
should be the largest input.
-
-o X
: Write the output toX.gbwt
. Required.
Example: merge_gbwt -o merged large small1 small2
inserts the sequences from small1.gbwt
and small2.gbwt
to large.gbwt
and writes the merged index to merged.gbwt
.
-
-f
: Fast algorithm. -
-i
: Insertion algorithm (default). -
-p
: Parallel algorithm.
-
-b N
: Use batch size ofN
sequences. If the batch size is 0, all sequences are inserted as a single batch. Default: 2000. -
-s N
: Use sample intervalN
in the inserted sequences. Set to 0 to sample only the end of each sequence. Default: 1024.
-
-C N
: Parallelize the search in chunks ofN
sequences per thread. Default: 1. -
-M N
: UseN
global merge buffers (merge 2^N
thread buffers before writing to disk). Default: 6. -
-P N
: Set the size of thread-specific (uncompressed) position buffers toN
megabytes. Default: 64. -
-S N
: UseN
search threads. Default: number of logical CPU cores. -
-t X
: Use directoryX
for temporary files. Default:.
. -
-T N
: Set the size of (compressed) thread buffer toN
megabytes. Default: 256.
This algorithm is used through a member function of DynamicGBWT
.
void merge(const GBWT& source, size_type batch_size = MERGE_BATCH_SIZE, size_type sample_interval = SAMPLE_INTERVAL)
-
source
: Input GBWT. -
batch_size
: Batch size in number of sequences. Use 0 to insert all sequences as a single batch. Default: 2000. -
sample_interval
: Sample interval for the inserted sequences.
Example:
DynamicGBWT input1;
sdsl::load_from_file(input1, input1_name);
GBWT input2;
sdsl::load_from_file(input2, input2_name);
input1.insert(input2);
This reads the index from file input1_name
as a dynamic GBWT input1
and the index from file input2_name
as a compressed GBWT input2
. Then it inserts the sequences from input2
into input1
.
The fast algorithm is a constructor of GBWT
.
GBWT(const std::vector<GBWT>& sources)
-
sources
: Input GBWTs in the order the sequences will be inserted.
Example:
std::vector<GBWT> inputs(argc - 1);
for(int i = 1; i < argc; i++)
{
std::string input_name = argv[i];
sdsl::load_from_file(inputs[i - 1], input_name + GBWT::EXTENSION);
}
GBWT merged(inputs);
This reads the GBWT indexes for the base names specified in the command line arguments and merges them into a single index.
Parallel...
Jouni Sirén: Burrows-Wheeler transform for terabases. Proc. DCC 2016, IEEE, pp. 211-220, Snowbird, Utah, USA, March 29 - April 1, 2016. DOI: 10.1109/DCC.2016.17