Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement correct pagination behavior for grouped models #114

Closed
lu-pl opened this issue Oct 25, 2024 · 0 comments · Fixed by #125
Closed

Implement correct pagination behavior for grouped models #114

lu-pl opened this issue Oct 25, 2024 · 0 comments · Fixed by #125
Assignees
Labels
enhancement New feature or request

Comments

@lu-pl
Copy link
Contributor

lu-pl commented Oct 25, 2024

Pagination behavior for grouped models currently does not work correctly.

The reason for that is simple: I somehow never got around to implement it. All rdfproxy queries are currently generated using the query constructor for ungrouped models. 😬

Current behavior

Given the following example model definition

query = """
select ?parent ?child ?name
where {
    values (?parent ?child ?name) {
        ('x' 'c' 'foo')
        ('y' 'd' UNDEF)
        ('y' 'e' UNDEF)
        ('z' UNDEF UNDEF)
    }
}
"""

class Child(BaseModel):
    name: str | None = None

class Parent(BaseModel):
    model_config = ConfigDict(group_by="parent")

    parent: str
    children: list[Child]

adapter = SPARQLModelAdapter(
    target="https://query.wikidata.org/bigdata/namespace/wdq/sparql",
    query=query,
    model=Parent,
)

app = FastAPI()

@app.get("/")
def base_route(page: int = 1, size: int = 100) -> Page[Parent]:
    return adapter.query(page=page, size=size)

paginating for page 1 of size 2 (http://localhost:8000/?page=1&size=2) will yield

{

      "items": [
            {
                  "parent": "x",
                  "children": [
                        {
                              "name": "foo"
                        }
                  ]
            },
            {
                  "parent": "y",
                  "children": [ ]
            }
      ],
      "page": 1,
      "size": 2,
      "total": 3,
      "pages": 2

}

paginating for page 2 of size 2 (http://localhost:8000/?page=2&size=2) will yield

{

      "items": [
            {
                  "parent": "y",
                  "children": [ ]
            },
            {
                  "parent": "z",
                  "children": [ ]
            }
      ],
      "page": 2,
      "size": 2,
      "total": 3,
      "pages": 2
}

-- both of which is incorrect.

Expected behavior

The expected result for paginating for page 1 of size 2 (http://localhost:8000/?page=1&size=2) is

{

      "items": [
            {
                  "parent": "x",
                  "children": [
                        {
                              "name": "foo"
                        }
                  ]
            },
            {
                  "parent": "y",
                  "children": [ ]
            },
      ],
      "page": 1,
      "size": 2,
      "total": 3,
      "pages": 2

}

and for paginating for page 2 of size 2 (http://localhost:8000/?page=2&size=2):

{

      "items": [
            {
                  "parent": "z",
                  "children": [ ]
            }
      ],
      "page": 2,
      "size": 2,
      "total": 3,
      "pages": 2

}

Solution proposal

Paginating grouped models correctly, requires a specific query constructor that constructs and injects a subquery (based on the original query) into the original query.

This touches upon a very fundamental problem rdfproxy aims to remedy: SPARQL result sets are inherently flat, i.e. for example in the README example. 6 SPARQL result rows are returned for 3 entities.

So the query from which grouped rdfproxy models obtain their input must return however many result rows there are for a limited number of entities.

E.g the following toy query

select ?parent ?child ?name
where {
    values (?parent ?child ?name) {
        ('x' 'c' 'foo')
        ('y' 'd' UNDEF)
        ('y' 'e' UNDEF)
        ('z' UNDEF UNDEF)
    }
}

returns 4 rows for 3 entities (given that an entity is defined by parent). If this is naively limited/offset with size 2, then the SPARQLBindingsMapper will only receive the first two result rows, ('x' 'c' 'foo') and ('y' 'd' UNDEF), so the other component of the y entity, ('y' 'e' UNDEF), simply won't be in the result set of the modified query, leading to the faulty behavior.

In order to retrieve however many result rows there are for a limited number of entities, a limited set of entities must be projected from a subquery into the scope of an outer query.

E.g. based on the above query, the rdfproxy query constructor needs to dynamically generate the following query:

select ?parent ?child ?name
where {
    values (?parent ?child ?name) {
        ('x' 'c' 'foo')
        ('y' 'd' UNDEF)
        ('y' 'e' UNDEF)
        ('z' UNDEF UNDEF)
    }

  {
    select distinct ?parent
    where {
      values (?parent ?child ?name) {
        ('x' 'c' 'foo')
        ('y' 'd' UNDEF)
        ('y' 'e' UNDEF)
        ('z' UNDEF UNDEF)
      }
    }
    order by ?parent
    limit 2
    offset 0
  }
}

Note

Although the above described feature is certainly doable, I feel like dynamic query generation is the most difficult and brittle part of rdfproxy.

However, I see no way around dynamic query generation, so that is alright.

Especially query generators need extensive testing though and it might be necessary to use more powerful templating facilities (Jinja) for dynamic query constructors in the future.

@lu-pl lu-pl self-assigned this Oct 25, 2024
@lu-pl lu-pl added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Oct 29, 2024
lu-pl added a commit that referenced this issue Oct 29, 2024
lu-pl added a commit that referenced this issue Oct 29, 2024
lu-pl added a commit that referenced this issue Oct 30, 2024
lu-pl added a commit that referenced this issue Oct 30, 2024
lu-pl added a commit that referenced this issue Nov 7, 2024
The get_items_query_constructor should be implemented using match/case; in
the very likely case that additional query constructors will be
needed (e.g. for ordering/filtering etc.), the pattern
matching mechanism for model_config provides a pluggable design which allows for easy extension.

Closes #114.
lu-pl added a commit that referenced this issue Nov 8, 2024
lu-pl added a commit that referenced this issue Nov 8, 2024
lu-pl added a commit that referenced this issue Nov 8, 2024
lu-pl added a commit that referenced this issue Nov 8, 2024
lu-pl added a commit that referenced this issue Nov 8, 2024
lu-pl added a commit that referenced this issue Nov 8, 2024
@lu-pl lu-pl closed this as completed in #125 Nov 8, 2024
@lu-pl lu-pl closed this as completed in 386d990 Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant