Skip to content

GBWT Merging

Jouni Siren edited this page May 8, 2021 · 10 revisions

General

GBWT indexes can be merged using various algorithms. Merging is primarily useful when dealing with large datasets. Because the direct construction algorithm is sequential, we can save wall clock time by building multiple indexes in parallel and merging them into a single index.

In the following, we merge the current index with the input index. The input index is assumed to be the smaller one. If the merging encounters an unrecoverable error, it terminates the program with std::exit(EXIT_FAILURE).

Merging indexes also merges the metadata but discards the tags.

See also: Direct construction

Insertion algorithm

The basic merging algorithm extracts the sequences from the compressed input GBWT and inserts them into the current dynamic GBWT. It is around 2 times slower than direct construction. This algorithm is single-threaded.

This algorithm is mostly useful for merging small GBWTs when the original sequences are no longer available.

Fast algorithm

When the node ids do not overlap between the GBWTs, the record for a particular node will be identical in the merged GBWT and the source GBWT containing that node. We can then merge the GBWTs quickly by interleaving the records. This algorithm is single-threaded.

The fast algorithm is inteded for merging the GBWTs for different chromosomes of the same genome. By building the chromosome-specific indexes in parallel and merging them into a whole-genome GBWT, we can significantly reduce the wall-clock time required for index construction.

Parallel algorithm

This is an adaptation of the BWT-merge algorithm for GBWT. The insertion algorithm searches for the input sequences in the current index, determines the lexicographic rank of each prefix of each sequence in the index, and updates the index immediately. In contrast, the parallel algorithm stores the lexicographic ranks, sorts them to form the rank array, and uses the rank array to interleave the GBWTs.

In the search phase, a number of search threads extract the new sequences from the dynamic input GBWT and search for them in the current dynamic GBWT. The sequences are assigned dynamically to the search threads. If the sequences are short or their length varies, it may be useful to increase the chunk size to assign multiple sequences at once to each thread. Each thread stores the positions (lexicographic ranks) it encounters in its position buffer. When the position buffer gets full, it is sorted, gap-encoded and merged with the similarly compressed thread buffer. When the thread buffer gets full, the search thread tries to insert it into the global merge buffers. There are M merge buffers, with buffer i containing 2^i merged thread buffers. If the merge buffers are also full, the search thread merges them and writes the 2^M merged thread buffers into a temporary file (see global settings).

The merge phase takes the rank array stored in the temporary files, merges it, and uses it to interleave the two dynamic GBWTs. There is a separate reader thread for reading, decompressing, and buffering each file. A merge thread reads the streams from each reader thread, merges them, and buffers the results. The main read reads the streamed rank array from the merge thread and uses it to interleave the GBWTs. The GBWT can be partitioned between multiple merge jobs, where each job has a separate rank array for interleaving the corresponding partition.

This algorithm is intended for merging GBWTs over the same chromosome. Separate jobs build GBWTs for batches of e.g. 2500 samples in parallel, and the resulting partial indexes are merged using the parallel algorithm. Because the direct construction algorithm is essentially sequential, the parallel merging algorithm can speed up index construction by a factor of 2 to 3.

The default parameters assume a system with 32 CPU cores, at least 128 GB memory, and a fast SSD for the temporary files. With J merge jobs, the rank array is stored in a number of 12/J GB files. Increasing (decreasing) the number of merge buffers by one doubles (halves) the size of the individual files, while the memory usage increases significantly (decreases moderately). Thread buffer size has a linear effect on memory usage and the size of the individual files. The number of search threads has an almost linear effect on memory usage.

Merging tool

Existing indexes can be merged with merge_gbwt.

merge_gbwt [options] -o output input1 input2 [input3 ...]

The program reads input1.gbwt, inserts the sequences from input2.gbwt (and further inputs) to it, and writes the merged index to output.gbwt. For best performance, input1 should be the largest input.

  • -o X: Write the output to X.gbwt. Required.
  • -O: Output SDSL format instead of simple-sds format.

Example: merge_gbwt -o merged large small1 small2 inserts the sequences from small1.gbwt and small2.gbwt to large.gbwt and writes the merged index to merged.gbwt.

Algorithm choice

  • -f: Fast algorithm.
  • -i: Insertion algorithm (default).
  • -p: Parallel algorithm.

Insertion algorithm

  • -b N: Use batch size of N sequences. If the batch size is 0, all sequences are inserted as a single batch. Default: 2000.
  • -s N: Use sample interval N in the inserted sequences. Set to 0 to sample only the end of each sequence. Default: 1024.

Parallel algorithm

  • -C N: Parallelize the search in chunks of N sequences per thread. Default: 1.
  • -J N: Run N parallel merge jobs. Default: 4.
  • -M N: Use N global merge buffers (merge 2^N thread buffers before writing to disk). Default: 6.
  • -P N: Set the size of thread-specific (uncompressed) position buffers to N megabytes. Default: 64.
  • -S N: Use N search threads. Default: number of logical CPU cores.
  • -t X: Use directory X for temporary files. Default: ..
  • -T N: Set the size of (compressed) thread buffer to N megabytes. Default: 256.

Interface

Insertion algorithm

This algorithm is used through a member function of DynamicGBWT.

void merge(const GBWT& source, size_type batch_size = MERGE_BATCH_SIZE, size_type sample_interval = SAMPLE_INTERVAL)
  • source: Input GBWT.
  • batch_size: Batch size in number of sequences. Use 0 to insert all sequences as a single batch. Default: 2000.
  • sample_interval: Sample interval for the inserted sequences.

Example:

DynamicGBWT input1;
sdsl::load_from_file(input1, input1_name);
GBWT input2;
sdsl::load_from_file(input2, input2_name);
input1.merge(input2);

This reads the index from file input1_name as a dynamic GBWT input1 and the index from file input2_name as a compressed GBWT input2. Then it inserts the sequences from input2 into input1.

Fast algorithm

The fast algorithm is a constructor of GBWT.

GBWT(const std::vector<GBWT>& sources)
  • sources: Input GBWTs in the order the sequences will be inserted.

Example:

std::vector<GBWT> inputs(argc - 1);
for(int i = 1; i < argc; i++)
{
  std::string input_name = argv[i];
  sdsl::load_from_file(inputs[i - 1], input_name + GBWT::EXTENSION);
}
GBWT merged(inputs);

This reads the GBWT indexes for the base names specified in the command line arguments and merges them into a single index.

Parallel algorithm

The parallel algorithm is a member function of DynamicGBWT.

void merge(const DynamicGBWT& source, const MergeParameters& parameters)
  • source: Input GBWT.
  • parameters: Merging parameters.

The merging parameters are stored in struct MergeParameters, which is defined in support.h. Its default constructor creates a structure with the default parameters. The member functions silently adjust the parameters if they are outside the acceptable range.

  • void setPosBufferSize(size_type megabytes): Set the position buffer size to megabytes MB. Default 64, minimum 1, maximum 16384.
  • void setThreadBufferSize(size_type megabytes): Set the thread buffer size to megabytes MB. Default 256, minimum 1, maximum 16384.
  • void setMergeBuffers(size_type n): Set the number of merge buffers to n. Default 6, minimum 1, maximum 16.
  • void setChunkSize(size_type n): Set the chunk size used in the search phase to n. Default 1, minimum 1.
  • void setMergeJobs(size_type n): Set the number of merge jobs in the merge phase to n. Default 4, minimum 1, maximum 16.

Example:

MergeParameters parameters;
parameters.setChunkSize(16);
DynamicGBWT input1;
sdsl::load_from_file(input1, input1_name);
DynamicGBWT input2;
sdsl::load_from_file(input2, input2_name);
input1.merge(input2, parameters);

This reads the index from file input1_name to input1 and the index from file input2_name to input2. Then it merges the indexes using chunk size 16 but otherwise default parameters.

References

Jouni Sirén: Burrows-Wheeler transform for terabases. Proc. DCC 2016, IEEE, pp. 211-220, Snowbird, Utah, USA, March 29 - April 1, 2016. DOI: 10.1109/DCC.2016.17