Timing support for GRAPH and FROM NAMED clauses in qlever #1501

Stiksels · 2024-09-17T11:14:03Z

@joka921 with PR #1444 merged, there is now a NQuad parser and it seems to work. However, the sparql server doesn't support the GRAPH and FROM NAMED clauses yet, so the named graphs are still queried as if they are all part from one large graph. Do you have a timeline for when these clauses are supported?

Originally posted by @Stiksels in #1334 (comment)

The text was updated successfully, but these errors were encountered:

hannahbast · 2024-09-17T11:45:04Z

@Stiksels You can already try out #1445 . If that is hard for you, we can build a docker image for you to test it. We have some deadline stress this week. We hope to have this in the master next week.

Stiksels · 2024-09-18T07:37:27Z

hi @hannahbast , thanks for the reply. I would be happy to test this with a custom docker image, but this would also be somewhere next week

sennierer · 2024-09-20T13:47:30Z

Tested the PR with a nq file with 3m triples.
Parsing worked, but it seems to not yet take the NGs into account. Putting one of the NGs in FROM <NG> showed the same number of triples as the query against the default graph. FROM NAMED or using a variable for the graph claims to be not implemented yet.
For testing I just checked out #1445 (server log shows: "QLever Server, compiled on Fri Sep 20 09:55:28 UTC 2024 using git hash ef5743"), created a docker image from it and used that with current main qlever-control repo. Might have missed something.
Also recognized that while my nq file has little over 3m triples and the statistics shows the same number, counting the triples in the SPARQL query brings little less than 3m triples.
However, thanks for the hard work and the great (open-source) piece of software. I am very happy that NG support is going to land in qlever, thats the last feature we need for using it in one of our new projects.
Obviously I am happy to test new versions of this with our data and/or retry if I did something wrong.

sennierer · 2024-09-20T14:04:23Z

For queries not only counting all triples, but filtering, I get the following error:

Error processing query

Assertion `numBlocksTotal == details.numBlocksRead_ || !limitOffset.isUnconstrained()` failed. Please report this to the developers. In file "/app/src/index/CompressedRelation.cpp " at line 254

sennierer · 2024-09-24T13:38:02Z

With latest commits in #1445 (a45668c) the error mentioned in the last comment is gone and using FROM clauses is working with my installation and test data!

joka921 · 2024-09-25T18:54:59Z

@sennierer
Can you try the latest version of #1445 (best rebuild your index, still work in progress)
In particular you should now also get the correct results for queries like
SELECT (COUNT(?x) as ?count) FROM <graph> {?x ?p ?o}
For some of those we have some optimizations which so far ignored the named graphs completely, they should now yield the expected results.

In particular, what will work with 1445 is

SELECT FROM
GRAPH {...}

What will not yet work, but follow soon is

SELECT FROM NAMED ...
GRAPH ?variable {...}

hannahbast · 2024-09-25T19:24:13Z

@Stiksels By the end of this week or the beginning of next week, this will be merged into the master and then also the latest Docker image will support FROM (not yet FROM NAMED) and GRAPH clauses with a fixed IRI (these are the the most important use cases).

Stiksels · 2024-09-26T11:57:27Z

@hannahbast thanks for the update, looking forward to test this! For our specific use case, datasets with many Named Entity Graphs, the GRAPH ?variable { ?s ?p ?o } feature will be very important too.

Is there a different PR that tracks this feature that I can follow?

sennierer · 2024-09-26T13:40:04Z

@joka921
I can confirm that with the new build (fdf4716) a count across one of the NGs with this query SELECT (COUNT(?x) as ?count) FROM <graph> {?x ?p ?o} brings exactly the same count (1,155,464) as running the same query in the fuseki where we currently host the dataset. I can also confirm that the query is much (!) faster in qlever than in fuseki (33ms in qlever vs 9.752 seconds in fuseki; not comparable though as qlever was running on my laptop, while fuseki is running in our cluster)

hannahbast · 2024-09-26T16:13:35Z

@hannahbast thanks for the update, looking forward to test this! For our specific use case, datasets with many Named Entity Graphs, the GRAPH ?variable { ?s ?p ?o } feature will be very important too.

Is there a different PR that tracks this feature that I can follow?

@Stiksels Can you provide a (not unnecessarily complex) example query for your use case?

Stiksels · 2024-09-27T07:50:00Z

@hannahbast were you able to download the compressed NQ dataset from #1468 ? This dataset contains Events (~2mio) and the Locations (~100k) where they took place. Each entity is stored in it's own named graph. This would be a typical query to fetch both event and location details:

SELECT ?event ?nl_title (MIN(?startTime) AS ?minStartTime) (MAX(?endTime) AS ?maxEndTime) ?location ?nl_locationName ?postcode ?fulladdress 
WHERE {
  GRAPH ?event {
  ?event <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E7_Activity> .
  ?event <http://purl.org/dc/terms/title> ?nl_title FILTER(LANG(?nl_title) = "nl").
  ?event <https://data.vlaanderen.be/ns/cultuurparticipatie#ruimtetijd>/<http://www.cidoc-crm.org/cidoc-crm/P161_has_spatial_projection> ?location .
  ?event <https://data.vlaanderen.be/ns/cultuurparticipatie#ruimtetijd>/<http://www.cidoc-crm.org/cidoc-crm/P160_has_temporal_projection> ?period .
  ?period <http://data.europa.eu/m8g/startTime> ?startTime .
  ?period <http://data.europa.eu/m8g/endTime> ?endTime FILTER(?endTime > (NOW()) ) . 
  }
  GRAPH ?location {
    ?location <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/terms/Location> .
    ?location <http://www.w3.org/ns/locn#locatorName> ?nl_locationName FILTER(LANG(?nl_locationName)) . 
    ?location <http://www.w3.org/ns/locn#address>/<http://www.w3.org/ns/locn#fullAddress> ?fulladdress FILTER(LANG(?fulladdress) = "nl" ) .
    ?location <http://www.w3.org/ns/locn#address>/<http://www.w3.org/ns/locn#postcode> ?postcode .
  }
}
GROUP BY ?event ?nl_title ?location ?nl_locationName ?postcode ?fulladdress
ORDER BY ?postcode ?maxEndTime 
LIMIT 1000

hannahbast · 2024-09-27T19:22:14Z

@Stiksels Thanks! I have just pushed a beta version with basic support for named graphs. You can try it with docker pull adfreiburg/qlever:named-graphs-beta and editing your Qleverfile to use that image.

And can you please provide the link to your dataset again? I was too late for each of your two posts regarding this in #1468

Stiksels · 2024-09-28T09:24:50Z

@hannahbast with this docker image, the IndexBuilder is significantly slower to parse the triples (0.2M/s) vs the image with tag latest (1.1M/s):

named-graphs-beta

Command: index

echo '{ "locale": { "language": "nl", "country": "BE", "ignore-punctuation": true }, "ascii-prefixes-only": false, "num-triples-per-batch": 100000 }' > uit-activiteiten.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.uit-activiteiten docker.io/adfreiburg/qlever:named-graphs-beta -c 'zcat data-nq/*.nq.gz | IndexBuilderMain -F nq - -i uit-activiteiten -s uit-activiteiten.settings.json --stxxl-memory 5G | tee uit-activiteiten.index-log.txt'

2024-09-28 09:19:29.834 - INFO: QLever IndexBuilder, compiled on Fri Sep 27 19:04:26 UTC 2024 using git hash 14e610
2024-09-28 09:19:29.838 - INFO: You specified the input format: NQ
2024-09-28 09:19:29.838 - INFO: Processing input triples from /dev/stdin ...
2024-09-28 09:19:29.840 - INFO: You specified "locale = nl_BE" and "ignore-punctuation = 1"
2024-09-28 09:19:29.840 - WARN: You are using Locale settings that differ from the default language or country.
        This should work but is untested by the QLever team. If you are running into unexpected problems,
        Please make sure to also report your used locale when filing a bug report. Also note that changing the
        locale requires to completely rebuild the index
2024-09-28 09:19:29.842 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files with a well-behaved use of newlines
2024-09-28 09:19:29.842 - INFO: You specified "num-triples-per-batch = 100,000", choose a lower value if the index builder runs out of memory
2024-09-28 09:19:29.842 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2024-09-28 09:19:29.984 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2024-09-28 09:20:30.469 - INFO: Triples parsed: 10,000,000 [average speed 0.2 M/s, last batch 0.2 M/s, fastest 0.2 M/s, slowest 0.2 M/s]

Stiksels · 2024-09-28T09:31:50Z

@Stiksels Thanks! I have just pushed a beta version with basic support for named graphs. You can try it with docker pull adfreiburg/qlever:named-graphs-beta and editing your Qleverfile to use that image.

And can you please provide the link to your dataset again? I was too late for each of your two posts regarding this in #1468

I uploaded the file to Google Drive and shared it with [email protected]. This is the downloadlink:
https://drive.google.com/file/d/1neV60Tch4bWzkhPAxE4o195QxBqF6qNo/view?usp=sharing

hannahbast · 2024-09-28T22:05:47Z

@Stiksels I now managed to download your datasets and build a QLever index for it in 2.5 minutes. The query uses the GRAPH variable also inside of the GRAPH clause (?event for the first clause, `?location´ for the second). This looks like a very unusual pattern: a graph IRI is usually not also a subject of a triple. Are you sure this is what you intended?

The query returns 30,589 rows, the first one is as follows. Is this correct?

?event	?nl_title	?minStartTime	?maxEndTime	?location	?nl_locationName	?postcode	?fulladdress
<https://data.publiq.be/id/event/udb/4a91540c-e3b8-4f30-b051-673362b6f6fa>	"Pleun Van Engelen - Liefs, Achilles"@nl	2025-06-06T18:00:00+00:00	2025-06-06T19:30:00+00:00	<https://data.publiq.be/id/place/udb/1764be5c-544a-4fbc-ad5d-ebceb759a654>	"Op locatie"@nl	"-"	"- -, - -, BE"@nl

Stiksels · 2024-09-30T07:46:29Z

@hannahbast thanks for the feedback. I'll look into updating the graph IRI so that it differs from the subject IRI of the event/location inside.

Would you have any explanation why there is such a difference in index-building-speed between your setup and mine? With the named-graphs-beta image, it's consistent at 0.2M/s to parse the triples. With the latest image, it's fast at 1.2M/s, but I have never been able to build an index in less than 5 minutes for this dataset...

Stiksels · 2024-10-02T14:47:04Z

With the latest docker image that contains #1520 the GRAPH and FROM NAMED clauses are now supported. I'll close this issue

Stiksels closed this as completed Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timing support for GRAPH and FROM NAMED clauses in qlever #1501

Timing support for GRAPH and FROM NAMED clauses in qlever #1501

Stiksels commented Sep 17, 2024

hannahbast commented Sep 17, 2024

Stiksels commented Sep 18, 2024

sennierer commented Sep 20, 2024

sennierer commented Sep 20, 2024

sennierer commented Sep 24, 2024

joka921 commented Sep 25, 2024

hannahbast commented Sep 25, 2024

Stiksels commented Sep 26, 2024

sennierer commented Sep 26, 2024

hannahbast commented Sep 26, 2024

Stiksels commented Sep 27, 2024

hannahbast commented Sep 27, 2024

Stiksels commented Sep 28, 2024

Stiksels commented Sep 28, 2024

hannahbast commented Sep 28, 2024

Stiksels commented Sep 30, 2024

Stiksels commented Oct 2, 2024

Timing support for GRAPH and FROM NAMED clauses in qlever #1501

Timing support for GRAPH and FROM NAMED clauses in qlever #1501

Comments

Stiksels commented Sep 17, 2024

hannahbast commented Sep 17, 2024

Stiksels commented Sep 18, 2024

sennierer commented Sep 20, 2024

sennierer commented Sep 20, 2024

sennierer commented Sep 24, 2024

joka921 commented Sep 25, 2024

hannahbast commented Sep 25, 2024

Stiksels commented Sep 26, 2024

sennierer commented Sep 26, 2024

hannahbast commented Sep 26, 2024

Stiksels commented Sep 27, 2024

hannahbast commented Sep 27, 2024

Stiksels commented Sep 28, 2024

Stiksels commented Sep 28, 2024

hannahbast commented Sep 28, 2024

Stiksels commented Sep 30, 2024

Stiksels commented Oct 2, 2024