-
Notifications
You must be signed in to change notification settings - Fork 56
Basic Terminology and Design Principles
By an "index" for an RDF dataset, we mean data structures that are pre-computed in order to be able to process SPARQL queries on the data fast. The main data structures of a QLever index are as follows:
The vocabulary is a list of all distinct IRIs and literals that occur in a triple from the dataset. Each IRI or literal from the vocabulary has a distinct IDs. The IDs are contiguous or almost contiguous.
All IRIs or literals of the same "type" have the property that if you order them by their ID, you also get their "natural" order. For example, for any two IRIs x and y, ID(x) < ID(y) if and only if x < y lexicographically. Similarly, for any two literals x and y of type integer (xsd:int), ID(x) < ID(y) if and only if x < y numerically. There are currently five such types: IRIs, integers, decimals, dates, and other literals. TODO: check.
The vocabulary is split into two parts: The "internal" vocabulary, which is stored in memory (limited space, fast access), and the "external" vocabulary, which is stored on disk ("unlimited" space, slower access). The rules for which word goes into which parts can be configured. For example, it can be specified that all IRIs with a particular prefix are in the external vocabulary. Or all literals over a specified length. Or all literals of a certain type. Or all literals of a certain language.
An index permutation is a list of all triples, where each IRI and literal is represented by its ID, in a particular order. Depending on the order, the representation can and is compressed in a variety of ways.
QLever currently stores index permutations in six sort orders: PSO, POS, SPO, SOP, OSP, OPS. Here S stands for subject, P stands for predicate, and O stands for object. For example, sort order PSO means that two triples are compared using the predicate ID as the primary key, and if those are equal, the subject ID as secondary key, and if those are equal too, the object as tertiary key. Naturally, in this order, there will be long runs of triples with the same predicate. QLever stores the predicate ID only once along with the index (in the sorted order) of the first triple with that predicate.
QLever also optionally precompute a list of distinct so-called patterns. A pattern of a subject is the set of all distinct predicates such that a triple with that subject and predicate exists in the input data. Naturally, many subjects share the same pattern. For example, on FreebaseEasy (a simple general-purpose knowledge graph with 360M triples) many entities of type <Person>
have exactly the three predicates <is-a>
, <Gender>
, and <Date_of_birth>
.
These patterns are needed for QLever's fast SPARQL autocompletion. But they are also useful for a variety of common statistics queries. TODO: given an example.