GBWT Merging

General

General...

If the merging encounters an unrecoverable error, it terminates the program with std::exit(EXIT_FAILURE).

Insertion algorithm

The basic merging algorithm extracts the sequences from a compressed GBWT and inserts them into a dynamic GBWT. It is around 2 times slower than direct construction. This algorithm is single-threaded.

This algorithm is mostly useful for merging small GBWTs when the original sequences are no longer available.

Fast algorithm

When the node ids do not overlap between the GBWTs, the record for a particular node will be identical in the merged GBWT and the source GBWT containing that node. We can then merge the GBWTs quickly by interleaving the records. This algorithm is single-threaded.

The fast algorithm is inteded for merging the GBWTs for different chromosomes of the same genome. By building the chromosome-specific indexes in parallel and merging them into a whole-genome GBWT, we can significantly reduce the wall-clock time required for index construction.

Parallel algorithm

This is an adaptation of the BWT-merge algorithm for GBWT. The insertion algorithm searches for the new sequences in the current index, determines the lexicographic rank of each prefix of each sequence in the index, and updates the index immediately. In contrast, the parallel algorithm stores the lexicographic ranks, sorts them to form the rank array, and uses the rank array to interleave the GBWTs.

In the search phase, a number of search threads extract the new sequences from a dynamic GBWT and search for them in the current index (also a dynamic GBWT). The sequences are assigned dynamically to the search threads. If the sequences are short or their length varies, it may be useful to increase the chunk size to assign multiple sequences at once to each thread. Each thread stores the positions (lexicographic ranks) it encounters in its position buffer. When the position buffer gets full, it is sorted, gap-encoded and merged with the similarly compressed thread buffer. When the thread buffer gets full, the search thread tries to insert it into the global merge buffers. There are M merge buffers, with buffer i containing 2^i merged thread buffers. If the merge buffers are also full, the search thread merges them and writes the 2^M merged thread buffers into a temporary file (see global settings).

The merge phase takes the rank array stored in the temporary files, merges it, and uses it to interleave the two dynamic GBWTs. There is a separate reader thread for reading, decompressing, and buffering each file. A merge thread reads the streams from each reader thread, merges them, and buffers the results. The main read reads the streamed rank array from the merge thread and uses it to interleave the GBWTs.

This algorithm is intended for merging GBWTs over the same chromosome. Separate jobs build GBWTs for batches of e.g. 2500 samples, and the resulting partial indexes are merged using the parallel algorithm. Because the direct construction algorithm is essentially sequential, the parallel merging algorithm can speed up index construction by a factor of 2 to 3.

Merging tool

Existing indexes can be merged with merge_gbwt.

merge_gbwt [options] -o output input1 input2 [input3 ...]

The program reads input1.gbwt, inserts the sequences from input2.gbwt (and further inputs) to it, and writes the merged index to output.gbwt. For best performance, input1 should be the largest input.

-o X: Write the output to X.gbwt. Required.

Example: merge_gbwt -o merged large small1 small2 inserts the sequences from small1.gbwt and small2.gbwt to large.gbwt and writes the merged index to merged.gbwt.

Algorithm choice

-f: Fast algorithm.
-i: Insertion algorithm (default).
-p: Parallel algorithm.

Insertion algorithm

-b N: Use batch size of N sequences. If the batch size is 0, all sequences are inserted as a single batch. Default: 2000.
-s N: Use sample interval N in the inserted sequences. Set to 0 to sample only the end of each sequence. Default: 1024.

Parallel algorithm

-C N: Parallelize the search in chunks of N sequences per thread. Default: 1.
-M N: Use N global merge buffers (merge 2^N thread buffers before writing to disk). Default: 6.
-P N: Set the size of thread-specific (uncompressed) position buffers to N megabytes. Default: 64.
-S N: Use N search threads. Default: number of logical CPU cores.
-t X: Use directory X for temporary files. Default: ..
-T N: Set the size of (compressed) thread buffer to N megabytes. Default: 256.

Interface

Insertion algorithm

This algorithm is used through a member function of DynamicGBWT.

void merge(const GBWT& source, size_type batch_size = MERGE_BATCH_SIZE, size_type sample_interval = SAMPLE_INTERVAL)

source: Input GBWT.
batch_size: Batch size in number of sequences. Use 0 to insert all sequences as a single batch. Default: 2000.
sample_interval: Sample interval for the inserted sequences.

Example:

DynamicGBWT input1;
sdsl::load_from_file(input1, input1_name);
GBWT input2;
sdsl::load_from_file(input2, input2_name);
input1.insert(input2);

This reads the index from file input1_name as a dynamic GBWT input1 and the index from file input2_name as a compressed GBWT input2. Then it inserts the sequences from input2 into input1.

Fast algorithm

The fast algorithm is a constructor of GBWT.

GBWT(const std::vector<GBWT>& sources)

sources: Input GBWTs in the order the sequences will be inserted.

Example:

std::vector<GBWT> inputs(argc - 1);
for(int i = 1; i < argc; i++)
{
  std::string input_name = argv[i];
  sdsl::load_from_file(inputs[i - 1], input_name + GBWT::EXTENSION);
}
GBWT merged(inputs);

This reads the GBWT indexes for the base names specified in the command line arguments and merges them into a single index.

Parallel algorithm

Parallel...

References

Jouni Sirén: Burrows-Wheeler transform for terabases. Proc. DCC 2016, IEEE, pp. 211-220, Snowbird, Utah, USA, March 29 - April 1, 2016. DOI: 10.1109/DCC.2016.17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GBWT Merging

General

Insertion algorithm

Fast algorithm

Parallel algorithm

Merging tool

Algorithm choice

Insertion algorithm

Parallel algorithm

Interface

Insertion algorithm

Fast algorithm

Parallel algorithm

References

Clone this wiki locally