-
Notifications
You must be signed in to change notification settings - Fork 56
QLever performance evaluation and comparison to other SPARQL engines
Here are the results of a simple performance evaluation and comparison of QLever, Virtuoso, Blazegraph, GraphDB, Stardog, Apache Jena, and Oxigraph on a moderately-sized dataset (DBLP) and on Wikidata (for those engines providing a public endpoint). More engines and more datasets will be added in the future. However, since all of the metrics below essentially scale linearly with the size of the dataset (at least for QLever), already the results on this one dataset say a lot.
All evaluations (of all engines) were run on an AMD Ryzen 9 7950X with 16 cores, 128 GB, and 7.1 TB of NVMe SSD. This is high-quality but affordable consumer hardware (as opposed to typical server hardware), with a total cost of around 2500 €.
The dataset used was the RDF dump of DBLP, version 02.04.2024 (1.8 GB compressed, 390 million triples, 68 predicates, see this SPARQL endpoint).
The following table compares loading time (in seconds), loading speed (million triples per second), and index size (in Gigabytes). The next to last column shows the average query time for the small benchmark detailed in the next section. The last column provides a subjective assessment of how easy or not it was to build the index and run queries (Blazegraph requires explicit chunking to load larger datasets, GraphDB's normal load takes forever, Virtuoso is old and error-prone with unusual interfaces, the setup for Stardog was by far the most complicated of all, see Section "Command lines ..." below).
SPARQL engine | Code | Loading time | Loading speed | Index size | Avg. query time | Ease of setup |
---|---|---|---|---|---|---|
Oxigraph | Rust | 640s | 0.6 M/s | 67 GB | 93s | very easy |
Apache Jena | Java | 2392s | 0.2 M/s | 42 GB | 69s | very easy |
Stardog | Java | 724s | 0.5 M/s | 28 GB | 17s | many hurdles |
GraphDB | Java | 1066s | 0.4 M/s | 28 GB | 16s | some hurdles |
Blazegraph | Java | 6326s | <0.1 M/s | 67 GB | 4.3s | some hurdles |
Virtuoso | C | 561s | 0.7 M/s | 13 GB | 2.2s | many hurdles |
QLever | C++ | 231s | 1.7 M/s | 8 GB | 0.7s | very easy |
The following table compares query processing times on six queries from the "Examples" of https://qlever.cs.uni-freiburg.de/dblp. The queries were selected for their variety (see the "Comment" column), not to make a particular engine look particularly good or bad. For each engine, the query times were measured after emptying the disk cache with sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
and starting the respective server from scratch. For QLever, its internal cache was cleared after each query (this makes it harder for QLever). For the other engines, no such precautions were taken. There was no significant (IO-heavy or CPU-heavy) activity on the machine during the evaluation. The > in one the table cells below indicates that Virtuoso, due to an internal limitation, downloaded only 1,048,576 of the around 7M results for the respective query.
Query | Result shape | Oxigraph | Jena | Stardog | GraphDB | Blazegraph | Virtuoso | QLever | Comment |
---|---|---|---|---|---|---|---|---|---|
All papers published in SIGIR | 6264 x 3 | 1.6s | 0.3s | 0.52s | 0.17s | 0.47s | 0.54s | 0.02s | Two simple joins, nothing special |
Number of papers by venue | 19954 x 2 | 2.6s | 28s | 2.0s | 3.1s | 1.2s | 1.0s | 0.02s | Scan of a single predicate with GROUP BY and ORDER BY |
Author names matching REGEX | 513 x 3 | 5.6s | 4.8s | 0.61s | 0.29s | 0.27s | 0.98s | 0.05s | Joins, GROUP BY, ORDER BY, FILTER REGEX |
All papers in DBLP until 1940 | 70 x 4 | 313s | 50s | 16s | 0.04s | 5.9s | 0.08s | 0.11s | Three joins, a FILTER, and an ORDER BY |
All papers with their title | 7167122 x 2 | 132s | 54s | 44s | 20s | 18s | >9.1s | 4.2s | Simple, but must materialize large result (problematic for many SPARQL engines) |
All predicates ordered by size | 68 x 3 | 106s | 279s | 37s | 72s | 0.05s | 1.48s | 0.01s | Conceptually requires a scan over all triples, but huge optimization potential |
The following is a performance comparison of four SPARQL endpoints for Wikidata on 298 example queries from the Wikidata Query Service. Column 6 provide the average query time only for those queries that did not fail; this gives an undue advantage to engines were many queries fail and therefore this number should be taken with a grain of salt. Note that MilleniumDB does not appear in the performance evaluation for DBLP above because even for that simple benchmark, with only six relatively straightforward queries, half of the queries fail.
Wikidata SPARQL endpoint | query time <= 1.0s | (1.0s, 5.0s] | > 5.0s | failed | avg. query time | median query time |
---|---|---|---|---|---|---|
Wikidata Query Service | 36% of all queries | 20% | 23% | 21% | 6.98s | 2.47s |
QLever | 78% of all queries | 11% | 9% | 2% | 1.38s | 0.24s |
Virtuoso | 54% of all queries | 15% | 20% | 11% | 4.11s | 0.74s |
MilleniumDB | 12% of all queries | 22% | 11% | 55% | 6.05s | > 50% failed |
Results were obtained using the following command-line, which launches one query after the other and records the time and results size (or whether the query failed):
qlever example-queries --get-queries-cmd "cat wikidata-queries.tsv" --download-or-count download --accept application/sparql-results+json --sparql-endpoint <URL of SPARQL endpoint>
The numbers for the table were obtained from the respective outputs using this command-line:
for RESULTS in wikidata-queries.*-results.txt; do printf "%-45s %4d %4d %4d %4d %7.2fs %6s\n" $RESULTS $(cat $RESULTS | \grep -o " [0-9]\+\.[0-9][0-9] s" | sed 's/[ s]//g' | awk '$1 <= 1.0' | wc -l) $(cat $RESULTS | \grep -o " [0-9]\+\.[0-9][0-9] s" | sed 's/[ s]//g' | awk '$1 > 1.0 && $1 <= 5.0' | wc -l) $(cat $RESULTS | \grep -o " [0-9]\+\.[0-9][0-9] s" | sed 's/[ s]//g' | awk '$1 > 5.0' | wc -l) $(cat $RESULTS | \grep -o " failed " | wc -l) $(\grep ^AVERAGE $RESULTS | \grep -o ' [0-9]\+\.[0-9][0-9] s' | sed 's/[ s]//g') $(cat $RESULTS | \grep "^ *[0-9]\+ " | sed 's/ failed / 60.00 s /g' | \grep -o " [0-9]\+\.[0-9][0-9] s" | sed 's/[ s]//g' | datamash median 1 | xargs printf "%6.2fs" | sed 's/60.00s/failed/'); done
For each engine, we created a folder with only the input file dblp.ttl.gz
and a file queries.tsv
obtained via curl -s https://qlever.cs.uni-freiburg.de/api/examples/dblp | sed -n '3p;4p;5p;6p;10p;15p' > queries.tsv
(see below for the contents). For Virtuoso, there was also the config file virtuoso.ini
(with generous settings regarding memory consumption). For QLever, there was the config file Qleverfile
(with standard settings).
git clone --recursive [email protected]:oxigraph/oxigraph.git
cd oxigraph/cli && cargo build --release && export PATH=$PATH:<oxigraph dir>/target/release
oxigraph load -f dblp.ttl.gz -l .
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
oxigraph serve-read-only -l . -b localhost:8015
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:8015/query
wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-5.0.0.zip
unzip apache-jena-fuseki-5.0.0.zip && rm -f $_
wget https://dlcdn.apache.org/jena/binaries/apache-jena-5.0.0.zip
unzip apache-jena-5.0.0.zip && rm -f $_
sudo apt update && sudo apt install -y openjdk-21-jdk
sudo update-alternatives --config java -> select JDK 21 (auto mode)
apache-jena-5.0.0/bin/tdb2.xloader --loc data dblp.ttl.gz
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
java -jar apache-jena-fuseki-5.0.0/fuseki-server.jar --port 8015 --loc data /dblp
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:8015/dblp
sudo apt install gnupg
curl http://packages.stardog.com/stardog.gpg.pub | sudo apt-key add
echo "deb http://packages.stardog.com/deb/ stable main" | sudo tee -a /etc/apt/sources.list
sudo apt-get update
sudo apt-get install -y stardog=10.0.1
sudo apt-get install bash-completion
source /opt/stardog/bin/stardog-completion.sh
sed -i 's/UseParallelOldGC/UseParallelGC/' /opt/stardog/bin/helpers.sh
export STARDOG_SERVER_JAVA_ARGS="-Xms20g -Xmx20g"
export STARDOG_PROPERTIES=$(pwd) && echo "memory.mode = bulk_load" > stardog.properties
stardog-admin server start
stardog-admin db create -n dblp dblp.ttl.gz
stardog-admin server stop
rm -f stardog.properties
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
stardog-admin server start --disable-security
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:5820/dblp/query
Fill out form on https://www.ontotext.com/products/graphdb/download/
Click on link "Platform-independent distribution" in mail and download graphdb-10.6.2-dist.zip
unzip graphdb-10.6.2-dist.zip && rm -f $_
graphdb-10.6.2/bin/console
> create graphdb [ID = dblp, rest = default]
> quit
graphdb-10.6.2/bin/importrdf preload -f -i dblp dblp.ttl.gz
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
graphdb-10.6.2/bin/graphdb
curl -s localhost:7200/repositories/dblp --data-urlencode 'query=SELECT * { ?s ?p ?o } LIMIT 1' [minimal warmup]
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:7200/repositories/dblp
wget https://github.com/blazegraph/database/releases/download/BLAZEGRAPH_2_1_6_RC/blazegraph.jar
java -server -Xmx20g -jar blazegraph.jar &
docker run -it --rm -v $(pwd):/data stain/jena riot --output=NT /data/dblp.ttl.gz | split -a 3 --numeric-suffixes=1 --additional-suffix=.nt -l 1000000 --filter='gzip > $FILE.gz' - dblp-
for CHUNK in dblp-???.nt.gz; do curl -s indus:9999/blazegraph/namespace/kb/sparql --data-binary update="LOAD <file://$(pwd)/${CHUNK}>"; done
kill %1
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
java -server -Xmx20g -jar blazegraph.jar &
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:9999/blazegraph/namespace/kb/sparql
git clone https://github.com/openlink/virtuoso-opensource virtuoso
sudo apt install -y autoconf automake libtool flex bison gperf gawk m4 make openssl
libsrc/Wi/sparql_io.sql -> change maxrows := 1024*1024: to 2*1024*1024 -2; [see https://github.com/openlink/virtuoso-opensource/issues/700]
./autogen.sh && ./configure && make && sudo make install
virtuoso.ini -> change: ServerPort = 8888, NumberOfBuffers and MaxDirtyBuffers = presets for 64 GB free, DefaultHost = hostname:8890, DirsAllowed = directory with the input files, ResultsSetMaxRows = 2000000, MaxQueryCostEstimationTime = 3600, MaxQueryExecutionTime = 3600, MaxQueryMem = 20G
isql-vt 8888
SQL> ld_dir('/local/data/qlever/qlever-indices/virtuoso-playground.ssd', 'dblp.ttl.gz', '');
SQL> DB.DBA.rdf_loader_run();
SQL> checkpoint;
SQL> exit;
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
/usr/bin/virtuoso-t -f &
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:8890/sparql
pip install qlever
qlever index
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
qlever start
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:7015
All papers published in SIGIR PREFIX dblp: <https://dblp.org/rdf/schema#> SELECT ?paper ?title ?year WHERE { ?paper dblp:title ?title . ?paper dblp:publishedIn "SIGIR" . ?paper dblp:yearOfPublication ?year } ORDER BY DESC(?year)
Number of papers by venue PREFIX dblp: <https://dblp.org/rdf/schema#> SELECT ?venue (COUNT(?paper) as ?count) WHERE { ?paper dblp:publishedIn ?venue } GROUP BY ?venue ORDER BY DESC(?count)
Author names matching REGEX PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dblp: <https://dblp.org/rdf/schema#> SELECT ?author ?author_label ?count WHERE { { SELECT ?author ?author_label (COUNT(?paper) as ?count) WHERE { ?paper dblp:authoredBy ?author . ?paper dblp:publishedIn "SIGIR" . ?author rdfs:label ?author_label } GROUP BY ?author ?author_label } FILTER REGEX(STR(?author_label), "M.*D.*", "i") } ORDER BY DESC(?count)
All papers in DBLP until 1940 PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dblp: <https://dblp.org/rdf/schema#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> SELECT ?title ?author ?author_label ?year WHERE { ?paper dblp:title ?title . ?paper dblp:authoredBy ?author . ?paper dblp:yearOfPublication ?year . ?author rdfs:label ?author_label . FILTER (?year <= "1940"^^xsd:gYear) } ORDER BY ASC(?year) ASC(?title)
All papers with their title (large result) PREFIX dblp: <https://dblp.org/rdf/schema#> SELECT ?paper ?title WHERE { ?paper dblp:title ?title }
All predicates, ordered by number of subjects SELECT ?predicate (COUNT(?subject) as ?count) WHERE { ?subject ?predicate ?object } GROUP BY ?predicate ORDER BY DESC(?count)