Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timing support for GRAPH and FROM NAMED clauses in qlever #1501

Closed
Stiksels opened this issue Sep 17, 2024 · 17 comments
Closed

Timing support for GRAPH and FROM NAMED clauses in qlever #1501

Stiksels opened this issue Sep 17, 2024 · 17 comments

Comments

@Stiksels
Copy link

@joka921 with PR #1444 merged, there is now a NQuad parser and it seems to work. However, the sparql server doesn't support the GRAPH and FROM NAMED clauses yet, so the named graphs are still queried as if they are all part from one large graph. Do you have a timeline for when these clauses are supported?

Originally posted by @Stiksels in #1334 (comment)

@hannahbast
Copy link
Member

@Stiksels You can already try out #1445 . If that is hard for you, we can build a docker image for you to test it. We have some deadline stress this week. We hope to have this in the master next week.

@Stiksels
Copy link
Author

hi @hannahbast , thanks for the reply. I would be happy to test this with a custom docker image, but this would also be somewhere next week

@sennierer
Copy link

Tested the PR with a nq file with 3m triples.
Parsing worked, but it seems to not yet take the NGs into account. Putting one of the NGs in FROM <NG> showed the same number of triples as the query against the default graph. FROM NAMED or using a variable for the graph claims to be not implemented yet.
For testing I just checked out #1445 (server log shows: "QLever Server, compiled on Fri Sep 20 09:55:28 UTC 2024 using git hash ef5743"), created a docker image from it and used that with current main qlever-control repo. Might have missed something.
Also recognized that while my nq file has little over 3m triples and the statistics shows the same number, counting the triples in the SPARQL query brings little less than 3m triples.
However, thanks for the hard work and the great (open-source) piece of software. I am very happy that NG support is going to land in qlever, thats the last feature we need for using it in one of our new projects.
Obviously I am happy to test new versions of this with our data and/or retry if I did something wrong.

@sennierer
Copy link

For queries not only counting all triples, but filtering, I get the following error:

Error processing query

Assertion `numBlocksTotal == details.numBlocksRead_ || !limitOffset.isUnconstrained()` failed. Please report this to the developers. In file "/app/src/index/CompressedRelation.cpp " at line 254

@sennierer
Copy link

With latest commits in #1445 (a45668c) the error mentioned in the last comment is gone and using FROM clauses is working with my installation and test data!

@joka921
Copy link
Member

joka921 commented Sep 25, 2024

@sennierer
Can you try the latest version of #1445 (best rebuild your index, still work in progress)
In particular you should now also get the correct results for queries like
SELECT (COUNT(?x) as ?count) FROM <graph> {?x ?p ?o}
For some of those we have some optimizations which so far ignored the named graphs completely, they should now yield the expected results.

In particular, what will work with 1445 is

  • SELECT FROM
  • GRAPH {...}

What will not yet work, but follow soon is

  • SELECT FROM NAMED ...
  • GRAPH ?variable {...}

@hannahbast
Copy link
Member

@Stiksels By the end of this week or the beginning of next week, this will be merged into the master and then also the latest Docker image will support FROM (not yet FROM NAMED) and GRAPH clauses with a fixed IRI (these are the the most important use cases).

@Stiksels
Copy link
Author

@hannahbast thanks for the update, looking forward to test this! For our specific use case, datasets with many Named Entity Graphs, the GRAPH ?variable { ?s ?p ?o } feature will be very important too.

Is there a different PR that tracks this feature that I can follow?

@sennierer
Copy link

@joka921
I can confirm that with the new build (fdf4716) a count across one of the NGs with this query SELECT (COUNT(?x) as ?count) FROM <graph> {?x ?p ?o} brings exactly the same count (1,155,464) as running the same query in the fuseki where we currently host the dataset. I can also confirm that the query is much (!) faster in qlever than in fuseki (33ms in qlever vs 9.752 seconds in fuseki; not comparable though as qlever was running on my laptop, while fuseki is running in our cluster)

@hannahbast
Copy link
Member

@hannahbast thanks for the update, looking forward to test this! For our specific use case, datasets with many Named Entity Graphs, the GRAPH ?variable { ?s ?p ?o } feature will be very important too.

Is there a different PR that tracks this feature that I can follow?

@Stiksels Can you provide a (not unnecessarily complex) example query for your use case?

@Stiksels
Copy link
Author

@hannahbast were you able to download the compressed NQ dataset from #1468 ? This dataset contains Events (~2mio) and the Locations (~100k) where they took place. Each entity is stored in it's own named graph. This would be a typical query to fetch both event and location details:

SELECT ?event ?nl_title (MIN(?startTime) AS ?minStartTime) (MAX(?endTime) AS ?maxEndTime) ?location ?nl_locationName ?postcode ?fulladdress 
WHERE {
  GRAPH ?event {
  ?event <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.cidoc-crm.org/cidoc-crm/E7_Activity> .
  ?event <http://purl.org/dc/terms/title> ?nl_title FILTER(LANG(?nl_title) = "nl").
  ?event <https://data.vlaanderen.be/ns/cultuurparticipatie#ruimtetijd>/<http://www.cidoc-crm.org/cidoc-crm/P161_has_spatial_projection> ?location .
  ?event <https://data.vlaanderen.be/ns/cultuurparticipatie#ruimtetijd>/<http://www.cidoc-crm.org/cidoc-crm/P160_has_temporal_projection> ?period .
  ?period <http://data.europa.eu/m8g/startTime> ?startTime .
  ?period <http://data.europa.eu/m8g/endTime> ?endTime FILTER(?endTime > (NOW()) ) . 
  }
  GRAPH ?location {
    ?location <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/dc/terms/Location> .
    ?location <http://www.w3.org/ns/locn#locatorName> ?nl_locationName FILTER(LANG(?nl_locationName)) . 
    ?location <http://www.w3.org/ns/locn#address>/<http://www.w3.org/ns/locn#fullAddress> ?fulladdress FILTER(LANG(?fulladdress) = "nl" ) .
    ?location <http://www.w3.org/ns/locn#address>/<http://www.w3.org/ns/locn#postcode> ?postcode .
  }
}
GROUP BY ?event ?nl_title ?location ?nl_locationName ?postcode ?fulladdress
ORDER BY ?postcode ?maxEndTime 
LIMIT 1000

@hannahbast
Copy link
Member

@Stiksels Thanks! I have just pushed a beta version with basic support for named graphs. You can try it with docker pull adfreiburg/qlever:named-graphs-beta and editing your Qleverfile to use that image.

And can you please provide the link to your dataset again? I was too late for each of your two posts regarding this in #1468

@Stiksels
Copy link
Author

@hannahbast with this docker image, the IndexBuilder is significantly slower to parse the triples (0.2M/s) vs the image with tag latest (1.1M/s):

named-graphs-beta

Command: index

echo '{ "locale": { "language": "nl", "country": "BE", "ignore-punctuation": true }, "ascii-prefixes-only": false, "num-triples-per-batch": 100000 }' > uit-activiteiten.settings.json
docker run --rm -u $(id -u):$(id -g) -v /etc/localtime:/etc/localtime:ro -v $(pwd):/index -w /index --init --entrypoint bash --name qlever.index.uit-activiteiten docker.io/adfreiburg/qlever:named-graphs-beta -c 'zcat data-nq/*.nq.gz | IndexBuilderMain -F nq - -i uit-activiteiten -s uit-activiteiten.settings.json --stxxl-memory 5G | tee uit-activiteiten.index-log.txt'

2024-09-28 09:19:29.834 - INFO: QLever IndexBuilder, compiled on Fri Sep 27 19:04:26 UTC 2024 using git hash 14e610
2024-09-28 09:19:29.838 - INFO: You specified the input format: NQ
2024-09-28 09:19:29.838 - INFO: Processing input triples from /dev/stdin ...
2024-09-28 09:19:29.840 - INFO: You specified "locale = nl_BE" and "ignore-punctuation = 1"
2024-09-28 09:19:29.840 - WARN: You are using Locale settings that differ from the default language or country.
        This should work but is untested by the QLever team. If you are running into unexpected problems,
        Please make sure to also report your used locale when filing a bug report. Also note that changing the
        locale requires to completely rebuild the index
2024-09-28 09:19:29.842 - INFO: You specified "parallel-parsing = true", which enables faster parsing for TTL files with a well-behaved use of newlines
2024-09-28 09:19:29.842 - INFO: You specified "num-triples-per-batch = 100,000", choose a lower value if the index builder runs out of memory
2024-09-28 09:19:29.842 - INFO: By default, integers that cannot be represented by QLever will throw an exception
2024-09-28 09:19:29.984 - INFO: Parsing input triples and creating partial vocabularies, one per batch ...
2024-09-28 09:20:30.469 - INFO: Triples parsed: 10,000,000 [average speed 0.2 M/s, last batch 0.2 M/s, fastest 0.2 M/s, slowest 0.2 M/s] 

@Stiksels
Copy link
Author

@Stiksels Thanks! I have just pushed a beta version with basic support for named graphs. You can try it with docker pull adfreiburg/qlever:named-graphs-beta and editing your Qleverfile to use that image.

And can you please provide the link to your dataset again? I was too late for each of your two posts regarding this in #1468

I uploaded the file to Google Drive and shared it with [email protected]. This is the downloadlink:
https://drive.google.com/file/d/1neV60Tch4bWzkhPAxE4o195QxBqF6qNo/view?usp=sharing

@hannahbast
Copy link
Member

@Stiksels I now managed to download your datasets and build a QLever index for it in 2.5 minutes. The query uses the GRAPH variable also inside of the GRAPH clause (?event for the first clause, `?location´ for the second). This looks like a very unusual pattern: a graph IRI is usually not also a subject of a triple. Are you sure this is what you intended?

The query returns 30,589 rows, the first one is as follows. Is this correct?

?event	?nl_title	?minStartTime	?maxEndTime	?location	?nl_locationName	?postcode	?fulladdress
<https://data.publiq.be/id/event/udb/4a91540c-e3b8-4f30-b051-673362b6f6fa>	"Pleun Van Engelen - Liefs, Achilles"@nl	2025-06-06T18:00:00+00:00	2025-06-06T19:30:00+00:00	<https://data.publiq.be/id/place/udb/1764be5c-544a-4fbc-ad5d-ebceb759a654>	"Op locatie"@nl	"-"	"- -, - -, BE"@nl

@Stiksels
Copy link
Author

@hannahbast thanks for the feedback. I'll look into updating the graph IRI so that it differs from the subject IRI of the event/location inside.

Would you have any explanation why there is such a difference in index-building-speed between your setup and mine? With the named-graphs-beta image, it's consistent at 0.2M/s to parse the triples. With the latest image, it's fast at 1.2M/s, but I have never been able to build an index in less than 5 minutes for this dataset...

@Stiksels
Copy link
Author

Stiksels commented Oct 2, 2024

With the latest docker image that contains #1520 the GRAPH and FROM NAMED clauses are now supported. I'll close this issue

@Stiksels Stiksels closed this as completed Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants