-
Notifications
You must be signed in to change notification settings - Fork 13
GBWT Merging
GBWT indexes can be merged using various algorithms. Merging is primarily useful when dealing with large datasets. Because the direct construction algorithm is sequential, we can save wall clock time by building multiple indexes in parallel and merging them into a single index.
In the following, we merge the current index with the input index. The input index is assumed to be the smaller one. If the merging encounters an unrecoverable error, it terminates the program with std::exit(EXIT_FAILURE)
.
Merging indexes also merges the metadata but discards the tags.
See also: Direct construction
The basic merging algorithm extracts the sequences from the compressed input GBWT and inserts them into the current dynamic GBWT. It is around 2 times slower than direct construction. This algorithm is single-threaded.
This algorithm is mostly useful for merging small GBWTs when the original sequences are no longer available.
When the node ids do not overlap between the GBWTs, the record for a particular node will be identical in the merged GBWT and the source GBWT containing that node. We can then merge the GBWTs quickly by interleaving the records. This algorithm is single-threaded.
The fast algorithm is inteded for merging the GBWTs for different chromosomes of the same genome. By building the chromosome-specific indexes in parallel and merging them into a whole-genome GBWT, we can significantly reduce the wall-clock time required for index construction.
This is an adaptation of the BWT-merge algorithm for GBWT. The insertion algorithm searches for the input sequences in the current index, determines the lexicographic rank of each prefix of each sequence in the index, and updates the index immediately. In contrast, the parallel algorithm stores the lexicographic ranks, sorts them to form the rank array, and uses the rank array to interleave the GBWTs.
In the search phase, a number of search threads extract the new sequences from the dynamic input GBWT and search for them in the current dynamic GBWT. The sequences are assigned dynamically to the search threads. If the sequences are short or their length varies, it may be useful to increase the chunk size to assign multiple sequences at once to each thread. Each thread stores the positions (lexicographic ranks) it encounters in its position buffer. When the position buffer gets full, it is sorted, gap-encoded and merged with the similarly compressed thread buffer. When the thread buffer gets full, the search thread tries to insert it into the global merge buffers. There are M merge buffers, with buffer i containing 2^i merged thread buffers. If the merge buffers are also full, the search thread merges them and writes the 2^M merged thread buffers into a temporary file (see global settings).
The merge phase takes the rank array stored in the temporary files, merges it, and uses it to interleave the two dynamic GBWTs. There is a separate reader thread for reading, decompressing, and buffering each file. A merge thread reads the streams from each reader thread, merges them, and buffers the results. The main read reads the streamed rank array from the merge thread and uses it to interleave the GBWTs. The GBWT can be partitioned between multiple merge jobs, where each job has a separate rank array for interleaving the corresponding partition.
This algorithm is intended for merging GBWTs over the same chromosome. Separate jobs build GBWTs for batches of e.g. 2500 samples in parallel, and the resulting partial indexes are merged using the parallel algorithm. Because the direct construction algorithm is essentially sequential, the parallel merging algorithm can speed up index construction by a factor of 2 to 3.
The default parameters assume a system with 32 CPU cores, at least 128 GB memory, and a fast SSD for the temporary files. With J merge jobs, the rank array is stored in a number of 12/J GB files. Increasing (decreasing) the number of merge buffers by one doubles (halves) the size of the individual files, while the memory usage increases significantly (decreases moderately). Thread buffer size has a linear effect on memory usage and the size of the individual files. The number of search threads has an almost linear effect on memory usage.
Existing indexes can be merged with merge_gbwt
.
merge_gbwt [options] -o output input1 input2 [input3 ...]
The program reads input1.gbwt
, inserts the sequences from input2.gbwt
(and further inputs) to it, and writes the merged index to output.gbwt
. For best performance, input1
should be the largest input.
-
-o X
: Write the output toX.gbwt
. Required. -
-O
: Output SDSL format instead of simple-sds format.
Example: merge_gbwt -o merged large small1 small2
inserts the sequences from small1.gbwt
and small2.gbwt
to large.gbwt
and writes the merged index to merged.gbwt
.
-
-f
: Fast algorithm. -
-i
: Insertion algorithm (default). -
-p
: Parallel algorithm.
-
-b N
: Use batch size ofN
sequences. If the batch size is 0, all sequences are inserted as a single batch. Default: 2000. -
-s N
: Use sample intervalN
in the inserted sequences. Set to 0 to sample only the end of each sequence. Default: 1024.
-
-C N
: Parallelize the search in chunks ofN
sequences per thread. Default: 1. -
-J N
: RunN
parallel merge jobs. Default: 4. -
-M N
: UseN
global merge buffers (merge 2^N
thread buffers before writing to disk). Default: 6. -
-P N
: Set the size of thread-specific (uncompressed) position buffers toN
megabytes. Default: 64. -
-S N
: UseN
search threads. Default: number of logical CPU cores. -
-t X
: Use directoryX
for temporary files. Default:.
. -
-T N
: Set the size of (compressed) thread buffer toN
megabytes. Default: 256.
This algorithm is used through a member function of DynamicGBWT
.
void merge(const GBWT& source, size_type batch_size = MERGE_BATCH_SIZE, size_type sample_interval = SAMPLE_INTERVAL)
-
source
: Input GBWT. -
batch_size
: Batch size in number of sequences. Use 0 to insert all sequences as a single batch. Default: 2000. -
sample_interval
: Sample interval for the inserted sequences.
Example:
DynamicGBWT input1;
sdsl::load_from_file(input1, input1_name);
GBWT input2;
sdsl::load_from_file(input2, input2_name);
input1.merge(input2);
This reads the index from file input1_name
as a dynamic GBWT input1
and the index from file input2_name
as a compressed GBWT input2
. Then it inserts the sequences from input2
into input1
.
The fast algorithm is a constructor of GBWT
.
GBWT(const std::vector<GBWT>& sources)
-
sources
: Input GBWTs in the order the sequences will be inserted.
Example:
std::vector<GBWT> inputs(argc - 1);
for(int i = 1; i < argc; i++)
{
std::string input_name = argv[i];
sdsl::load_from_file(inputs[i - 1], input_name + GBWT::EXTENSION);
}
GBWT merged(inputs);
This reads the GBWT indexes for the base names specified in the command line arguments and merges them into a single index.
The parallel algorithm is a member function of DynamicGBWT
.
void merge(const DynamicGBWT& source, const MergeParameters& parameters)
-
source
: Input GBWT. -
parameters
: Merging parameters.
The merging parameters are stored in struct MergeParameters
, which is defined in support.h. Its default constructor creates a structure with the default parameters. The member functions silently adjust the parameters if they are outside the acceptable range.
-
void setPosBufferSize(size_type megabytes)
: Set the position buffer size tomegabytes
MB. Default 64, minimum 1, maximum 16384. -
void setThreadBufferSize(size_type megabytes)
: Set the thread buffer size tomegabytes
MB. Default 256, minimum 1, maximum 16384. -
void setMergeBuffers(size_type n)
: Set the number of merge buffers ton
. Default 6, minimum 1, maximum 16. -
void setChunkSize(size_type n)
: Set the chunk size used in the search phase ton
. Default 1, minimum 1. -
void setMergeJobs(size_type n)
: Set the number of merge jobs in the merge phase ton
. Default 4, minimum 1, maximum 16.
Example:
MergeParameters parameters;
parameters.setChunkSize(16);
DynamicGBWT input1;
sdsl::load_from_file(input1, input1_name);
DynamicGBWT input2;
sdsl::load_from_file(input2, input2_name);
input1.merge(input2, parameters);
This reads the index from file input1_name
to input1
and the index from file input2_name
to input2
. Then it merges the indexes using chunk size 16 but otherwise default parameters.
Jouni Sirén: Burrows-Wheeler transform for terabases. Proc. DCC 2016, IEEE, pp. 211-220, Snowbird, Utah, USA, March 29 - April 1, 2016. DOI: 10.1109/DCC.2016.17