Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query result ordering for ungrouped models #128

Open
lu-pl opened this issue Oct 31, 2024 · 3 comments · May be fixed by #140
Open

Query result ordering for ungrouped models #128

lu-pl opened this issue Oct 31, 2024 · 3 comments · May be fixed by #140
Assignees

Comments

@lu-pl
Copy link
Contributor

lu-pl commented Oct 31, 2024

rdfproxy injects an ORDER BY solution modifier into SPARQL queries for grouped models, see #114.

The query constructor for SPARQL queries associated with ungrouped models currently does not inject an ORDER BY modifier - and I am not entirely decided if it even has to.

For a query associated with an ungrouped model

select * where {?s ?p ?o.}

the constructor for page 1 of size 2 currently constructs:

select * 
where {
    ?s ?p ?o .
}
limit 2
offset 0

I think that a purely SPARQL-based pagination mechanism (i.e. the pagination mechanism of rdfproxy) might not be able to guarantee static and reproducible page results without ORDER BY - but this also depends on the Triplestore and possible Triplestore indexing which might apply ordering across LIMIT/OFFSET queries.

Anyway, I think that queries associated with ungrouped models should have ORDER BY modifiers injected.

This raises two problems however:

  1. By which SPARQL variables to order by?

One solution to this is to simply ORDER BY all variables of the projection, utilizing the fact that the ORDER BY solution modifier takes a sequence of order comparators (SPARQL 1.1 Spec).

I.e. for the above query example the constructor should generate:

select * 
where {
    ?s ?p ?o .
}
order by ?s ?p ?o
limit 2
offset 0

The implications of this approach need discussion and careful evaluation though.

E.g. I am currently not sure if the rdflib implementation of SPARQL algebra allows to retrieve ordered variables for "SELECT *" clauses.

from rdflib.plugins.sparql import prepareQuery

def get_sparql_vars(query: str) -> str:
    prepared = prepareQuery(query)
    projection = prepared.algebra["PV"]

    return " ".join(map(lambda x: f"?{x}", projection))

get_sparql_vars("select * where {?s ?p ?o .}")

returns "?s ?o ?p".

How meaningful is the order of ORDER BY comparators??

  1. Performance

Unlike the ordering of grouped model queries, the SPARQL result set is not reduced by a subquery first.
This means that ordering ungrouped model queries means ordering over the entire graph.

Even for medium-sized graphs this can mean a notable performance penalty, especially when ordering by a sequence of comparators.

@lu-pl
Copy link
Contributor Author

lu-pl commented Nov 8, 2024

Note: Using (a variation of) the query constructor for grouped models is NOT an option!

The idea of the constructor for grouped models is to retrieve however many result rows there are for a limited number of entities (see #114) - this relies on ModelBindingsMapper to reduce those results to exactly the limited number of entities.

So using the query constructor for grouped models on an ungrouped model would certainly lead to incorrect pagination.

E.g. if one used the grouped model query constructor for a query select ?s ?p ?o where {?s ?p ?o .}, this would generate

select ?s ?p ?o
where {
    ?s ?p ?o .
    
    {
        select distinct ?s
        where {
            ?s ?p ?o .
        }
        order by ?s
	limit 10
	offset 10
    }
}

for page 2 of size 10 (if somehow ?s was specified as ordering variable).

However, since the ModelBindingsMapper code path for grouped models obviously won't be triggered for an ungrouped model definition (because it shouldn't!), this would actually relay however many result rows there are for the limited number of entities - which very likely won't coincide with the requested pagination size.

I think there is no way around ordering over the entire graph for ungrouped model definitions.
The above suggestion of injecting ORDER BY all variables of the projection (by default) is the correct way to implement this.

Backend implementers should be able to apply custom result ordering though. Whether an ORDER BY solution modifier should be permissible for incoming queries or custom ordering should be specified in the model_config needs discussion.

Also, I think ungrouped queries over large graphs are kind of an edge case for rdfproxy. This behavior could be achieved with naked SPARQL alone, all rdfproxy does in that case is to wrap the results in a Pydantic model (e.g. for FastAPI consumption).

@lu-pl
Copy link
Contributor Author

lu-pl commented Nov 11, 2024

The above suggestion of injecting ORDER BY all variables of the projection (by default) is the correct way to implement this.

This needs partial revision. Another - computationally less costly and probably better - default would be to order by the first SPARQL variable of the projection.

Given a query select * where { ?s ?p ?o . }, the rdfproxy query constructor should then generate:

select * 
where {
    ?s ?p ?o .
}
order by ?s
limit 2
offset 0

@lu-pl
Copy link
Contributor Author

lu-pl commented Nov 11, 2024

Possible implementation for extracting SPARQL projections:

def get_query_projection(query: str) -> list[str]:
    query = parseQuery(query)[1]
    try:
        projection = query["projection"]  # type: ignore
        return [attrgetter("var")(var) for var in projection]
    except KeyError:
        triples = query["where"]["part"][0]["triples"]  # type: ignore
        projection = dict.fromkeys(
            i for i in chain.from_iterable(triples) if isinstance(i, Variable)
        )
        return list(projection)

Interaction with raw pyparsing.ParseResults objects is kind of wonky though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant