-
Notifications
You must be signed in to change notification settings - Fork 13
Data Model
Graph BWT (GBWT) is a substring index for collections of similar paths in a graph. It is implemented as a multi-string BWT of the node sequences. Hence GBWT is a variant of the graph extension (gPBWT) of the positional BWT (PBWT).
We build BWT for the reverse sequences, or equivalently sort the reverse prefixes of the paths in lexicographic order. As a result, LF-mapping moves forward on the paths.
Note: paths and sequences are synonymous in GBWT. In general, we use "path" in high-level discussion and "sequence" when discussing implementation details. In some contexts, we use "path identifier" for the unoriented identifiers in metadata and "sequence identifier" for the identifiers of oriented paths stored in a bidirectional GBWT.
There is a separate page for further discussion on node and path identifiers.
- We store arbitrary paths in a general graph.
- The collection of paths is highly repetitive, making the BWT compressible with run-length encoding.
- There are less than 2^32 nodes in the graph and less than 2^32 occurrences of each node in the collection.
- The set of node identifiers is a dense subset of unsigned 32-bit integers in some range [a,b].
- Each path is terminated by endmarker node 0 (
ENDMARKER
), which is not a real node of the graph. - Because the GBWT is a multi-string BWT, LF-mapping does not work correctly with the endmarker.
- Path identifiers are integers starting from 0 denoting the order of insertion.
The GBWT assumes a node-centric graph where paths are sequences of node visits. Empty paths are supported but discouraged, because they are unintuitive. Unlike in an edge-centric graph where paths are sequences of node transitions, an empty path is not located anywhere in the graph. As a result, algorithms based on, for example, partitioning a graph into components may fail to handle empty paths properly.
typedef sdsl::int_vector<0> text_type;
typedef sdsl::int_vector_buffer<0> text_buffer_type;
typedef std::vector<short_type> vector_type;
The input is a concatenation of 0-terminated integer sequences. It is stored as an integer vector: either in memory as int_vector<0>
or vector<short_type>
, or on disk as int_vector_buffer<0>
. Until v0.4, std::vector<node_type>
was used instead of vector_type
.
Note: Because vector_type
uses 32-bit integers, it cannot be initialized from an initializer list of node_type
values. Use text_type
instead.
GBWT construction is incremental. We insert a batch of sequences into a dynamic FM-index, extending each sequence in the batch simultaneously by one node.
We split the BWT into records. Each record corresponds to a node v in the graph. It contains the part of the BWT corresponding to the reverse prefixes starting with node v (prefixes ending with node v). The information required by find()
and extract()
queries is stored locally in the records, while locate()
queries use a global data structure.
See File Formats for further details.
Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict Paten, and Richard Durbin: Haplotype-aware graph indexes. Bioinformatics 36(2):400-407, 2020. DOI: 10.1093/bioinformatics/btz575
Richard Durbin: Efficient haplotype matching and storage using the Positional Burrows-Wheeler Transform (PBWT). Bioinformatics 30(9):1266-1272, 2014. DOI: 10.1093/bioinformatics/btu014
Adam M. Novak, Erik Garrison, and Benedict Paten: A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms for Molecular Biology 12:18, 2017. DOI: 10.1186/s13015-017-0109-9