Skip to content

Data Model

Jouni Siren edited this page Jan 21, 2025 · 20 revisions

Introduction

Graph BWT (GBWT) is a substring index for collections of similar paths in a graph. It is implemented as a multi-string BWT of the node sequences. Hence GBWT is a variant of the graph extension (gPBWT) of the positional BWT (PBWT).

We build BWT for the reverse sequences, or equivalently sort the reverse prefixes of the paths in lexicographic order. As a result, LF-mapping moves forward on the paths.

Note: paths and sequences are synonymous in GBWT. In general, we use "path" in high-level discussion and "sequence" when discussing implementation details. In some contexts, we use "path identifier" for the unoriented identifiers in metadata and "sequence identifier" for the identifiers of oriented paths stored in a bidirectional GBWT.

There is a separate page for further discussion on node and path identifiers.

Assumptions

  • We store arbitrary paths in a general graph.
  • The collection of paths is highly repetitive, making the BWT compressible with run-length encoding.
  • There are less than 2^32 nodes in the graph and less than 2^32 occurrences of each node in the collection.
  • The set of node identifiers is a dense subset of unsigned 32-bit integers in some range [a,b].
  • Each path is terminated by endmarker node 0 (ENDMARKER), which is not a real node of the graph.
  • Because the GBWT is a multi-string BWT, LF-mapping does not work correctly with the endmarker.
  • Path identifiers are integers starting from 0 denoting the order of insertion.

Empty paths

The GBWT assumes a node-centric graph where paths are sequences of node visits. Empty paths are supported but discouraged, because they are unintuitive. Unlike in an edge-centric graph where paths are sequences of node transitions, an empty path is not located anywhere in the graph. As a result, algorithms based on, for example, partitioning a graph into components may fail to handle empty paths properly.

Construction and input format

typedef sdsl::int_vector<0>        text_type;
typedef sdsl::int_vector_buffer<0> text_buffer_type;
typedef std::vector<short_type>    vector_type;

The input is a concatenation of 0-terminated integer sequences. It is stored as an integer vector: either in memory as int_vector<0> or vector<short_type>, or on disk as int_vector_buffer<0>. Until v0.4, std::vector<node_type> was used instead of vector_type.

Note: Because vector_type uses 32-bit integers, it cannot be initialized from an initializer list of node_type values. Use text_type instead.

GBWT construction is incremental. We insert a batch of sequences into a dynamic FM-index, extending each sequence in the batch simultaneously by one node.

Encoding

We split the BWT into records. Each record corresponds to a node v in the graph. It contains the part of the BWT corresponding to the reverse prefixes starting with node v (prefixes ending with node v). The information required by find() and extract() queries is stored locally in the records, while locate() queries use a global data structure.

See File Formats for further details.

Citing GBWT

Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict Paten, and Richard Durbin: Haplotype-aware graph indexes. Bioinformatics 36(2):400-407, 2020. DOI: 10.1093/bioinformatics/btz575

Other references

Richard Durbin: Efficient haplotype matching and storage using the Positional Burrows-Wheeler Transform (PBWT). Bioinformatics 30(9):1266-1272, 2014. DOI: 10.1093/bioinformatics/btu014

Adam M. Novak, Erik Garrison, and Benedict Paten: A graph extension of the positional Burrows-Wheeler transform and its applications. Algorithms for Molecular Biology 12:18, 2017. DOI: 10.1186/s13015-017-0109-9

Clone this wiki locally