Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gravsearch query distinct property values #1319

Open
gfoo opened this issue May 14, 2019 · 31 comments
Open

Gravsearch query distinct property values #1319

gfoo opened this issue May 14, 2019 · 31 comments
Assignees
Labels
Gravsearch question further information is requested

Comments

@gfoo
Copy link

gfoo commented May 14, 2019

Is there a more effective way to get the list of distinct values of a property (hasJournalTitle) other than to go through all the pages?
Actually, I do that in the background on startup and it works fine, this list is used in a combo box (progressively filled) of an adv search UI. It takes about 1 min to cover about 200 hundreds pages, there are many duplicates, may be there is a way to reduce the number of pages?

PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX ll: <http://0.0.0.0:3333/ontology/0113/lumieres-lausanne/simple/v2#>
CONSTRUCT {
  ?BibliographicNotice knora-api:isMainResource true .
  ?BibliographicNotice ll:hasJournalTitle ?hasJournalTitle .
} WHERE {
  ?BibliographicNotice a knora-api:Resource .
  ?BibliographicNotice a ll:BibliographicNotice .  
  ?BibliographicNotice ll:hasJournalTitle ?hasJournalTitle .
}
ORDER BY ASC(?hasJournalTitle)
@gfoo gfoo added question further information is requested Gravsearch labels May 14, 2019
@gfoo gfoo changed the title Gravsearch query distinct properties value Gravsearch query distinct property values May 14, 2019
@benjamingeer
Copy link

benjamingeer commented May 14, 2019

One possibility would be to normalise your data model. You could have a Journal resource class, with one instance for each distinct title. Instead of the hasJournalTitle property, you would have a hasJournal property, pointing to the IRI of a Journal. Then you could get all the distinct journal titles by querying all the Journal instances.

@gfoo
Copy link
Author

gfoo commented May 14, 2019

Yep, we thought about that or to use a list, but before to add one more resource I wanted to know if I was missing a solution with our current model because I've seen that the DISTINCT keyword could be used with CONSTRUCT : https://stackoverflow.com/questions/55266711/sparql-construct-with-distinct

@gfoo
Copy link
Author

gfoo commented May 14, 2019

anyway my current solution works fine, there is a very low probability that a user wants the full list before the app finished to load all the data.

@benjamingeer
Copy link

In SPARQL, DISTINCT can be used in a SELECT subquery nested in a CONSTRUCT query, but Gravsearch doesn't support subqueries.

Doing a query that takes a minute per user session seems like it's going to put an unnecessary load on the triplestore.

There could also be other advantages to normalising the data model; you might want to store other information about these journals.

@mrivoal
Copy link

mrivoal commented May 14, 2019

Hi Ben.

The current data model is the result of the migration of a not-so-great SQL data model in the first place. Given the amount of data, the burden of the import and the amount of time we have already spend on this project and the way the current users see their database, in all this respect, normalising the data model is not the best option. Even if, model-wise, it makes sense, of course.

@gfoo
Copy link
Author

gfoo commented May 14, 2019

Doing a query that takes a minute per user session seems like it's going to put an unnecessary load on the triplestore.

no, 200 queries (pages). I could also add some delay between queries to let the triplestroe take some breath :)

@benjamingeer
Copy link

I'm just saying that those 200 queries have to be run at the start of every user session, which could consume a lot of resources if there are a lot of users (e.g. a classroom full of students all opening the same page at the same time). Of course, I understand that you might not have time to do this now, but I think it would make sense to do it.

@benjamingeer
Copy link

the way the current users see their database

If you overload the triplestore with queries, they might not see anything at all...

@mrivoal
Copy link

mrivoal commented May 14, 2019

Amongst the parameters that I bear in mind to design a data model, I have to say that I never thought about not overloading the triplestore.

The data model has to mean something eventually for the user, conceptually speaking, it also has to mean something in a generic GUI, and users have to understand it well enough to make some queries. And if they don't want to attach any metadata to a journal, then I think there is no point introducing a new class which will make the data model even more (and unnecessarily) complicated.

I think we should look for another workaround here.

@benjamingeer
Copy link

Amongst the parameters that I bear in mind to design a data model, I have to say that I never thought about not overloading the triplestore.

This is absolutely something you must bear in mind all the time. If a data model requires queries that will be too slow, or will consume excessive resources, it has to be designed differently. You're going to be sharing the DaSCH triplestore with all other projects in the DaSCH. We in the DaSCH have to make sure that no project is going to consume excessive resources, which could cause problems for everyone else.

I feel like I keep repeating this a lot: triplestore CPU cycles are a scarce resource. They have to be used very frugally.

The data model has to mean something eventually for the user

If a user can understand the concept of a journal title, it seems to me that they can certainly understand the concept of a journal.

@benjamingeer
Copy link

Just to put this in context: in the past 5 years, the Knora developers have spent a lot of time working on improving the performance of SPARQL queries, and on implementing things in Knora in ways that will allow it to scale (e.g. by caching query results). That implementation work is one part of the solution. Another part is the money spent on triplestore licences, and perhaps eventually a triplestore cluster. But projects have to be involved in thinking about how to make their code scalable, too. All these things together are necessary to make the DaSCH infrastructure perform well for everyone. We have to conserve resources wherever we can, especially because we're building a system that responds in real time. Your bank doesn't process your transactions in real time; it processes everything in a batch job late at night, because that's more efficient. We will probably have to do that for some kinds of operations, too (e.g. bulk imports). But we need a system that can do a lot of interesting work in real time, and that means we have to be very careful about designing requests so they can scale. Sometimes that will mean eliminating redundancy in a data model. If we can replace 200 queries with 20 queries, that seems like a big win to me.

@mrivoal
Copy link

mrivoal commented May 14, 2019

I am still not convinced that technical limitations should drive and restrain the data model of a research project...

Besides, there is often a gap between the ideal data model that you may have in mind for a project and its implementation because, eventually, the researcher is the one who decides how his data will be shaped (or he should at least share your views on the model...). I see myself as enforcing good practices in as much as these good practices don't interfere with the research process, but it is a bargaining. If good practices interfere too much, researchers choose another software. I understand your point of view, but you should also understand mine. Data models need to get the approval of researchers. Or they will walk away.

But projects have to be involved in thinking about how to make their code scalable, too.

Yes, but you should know that the understanding of research teams on this matter can be limited. And enforcing a research project to use a well-structured bibliography model is not a little thing. Especially when the projects is already a complicated one.

@benjamingeer
Copy link

benjamingeer commented May 14, 2019

I am still not convinced that technical limitations should drive and restrain the data model of a research project...

Ask any experienced software developer who has worked on database design. Database design is always done with performance considerations in mind. Do a Google search for "database design performance" and you will find a ton of material on this. It doesn't matter whether your database contains research data, financial data, or biomedical data: if your data model doesn't allow your queries to run efficiently, your project will fail.

it is a bargaining

Absolutely. But researchers need to understand that there are technical constraints with all software systems, and that they are sharing infrastructure with many other people. I encourage you and the researchers to think of this as something like an environmental issue. No one can be allowed to pollute the environment so much that it becomes uninhabitable for everyone else.

If you make too many requests per second to Facebook's API, Facebook will block your requests. This is called "rate limiting". Knora doesn't currently implement rate limiting, but perhaps we will have to. And we are not Facebook, with practically infinite resources to spend. We are doing the best we can with the resources we have.

If good practices interfere too much

You have not explained to me how introducing the concept of "Journal" along with "Journal title" would interfere with anything.

enforcing a research project to use a well-structured bibliography model is not a little thing

I'm not asking for that. I'm just asking you to avoid making 200 queries that return mostly redundant data, which is just basic good software development practice. I don't think that's asking for the moon.

@gfoo
Copy link
Author

gfoo commented May 14, 2019

there are technical constraints with all software systems,

for sure, even with a relational database we probably should add an index and/or cache to get correct performances.

The dev context is difficult for everyone. We don't have the resources and time to migrate the model and data from a fine tuned database to Knora and in same time upgrade their complex web app to modern UX design concepts and framework. So we preferred to reproduce the same old fashioned web site with a descent framework (Angular), even if it means from time to time finds ugly (temporarily) workarounds :)

That being said, back to my first problem: is that make sense to get all these values? no, it is a workaround and quite stupid solution. From about 5 000 values (200 pages x 25), I keep only 256 distinct values, mutli-selectable in a combo box. Not to mention the fact that nobody will read all those values! The only solution I see is to provide a search textfield filling the combo box according to the textfield (imitating the Salsah way to link resources). But as usual, it takes a little longer to develop.

This is probably the cost to pay in modern web app, provides many many data with smooth and reactive user interface, and all that with just one developer. This is almost the moon ;-)

@musicEnfanthen
Copy link
Contributor

In SPARQL, DISTINCT can be used in a SELECT subquery nested in a CONSTRUCT query, but Gravsearch doesn't support subqueries.

Is there a special reason for that? Using DISTINCT in a SELECT subquery seems to reduce the resource costs very effectively, by 20, here (going from 5000 (200 pages) values to only 256 results (11 pages))

@benjamingeer
Copy link

Is there a special reason for that?

I would guess it’s because every CONSTRUCT result is a set of triples, and the members of a set are distinct by definition.

@benjamingeer
Copy link

The only solution I see is to provide a search textfield filling the combo box according to the textfield (imitating the Salsah way to link resources). But as usual, it takes a little longer to develop.

If it helps, there's an API v2 route for this:

https://docs.knora.org/paradox/03-apis/api-v2/reading-and-searching-resources.html#search-for-a-resource-by-its-rdfs-label

This is probably the cost to pay in modern web app, provides many many data with smooth and reactive user interface, and all that with just one developer. This is almost the moon ;-)

I understand completely. I'm not saying you have to do this now, I'm just saying please do it later when you have time.

@loicjaouen
Copy link
Contributor

A distinct would still be really helpful. We have another project that requires that.

It has a resource with a date property, one thousand occurrences of this resource spanning over two hundreds years.
In order to show the resources the team wants to print the years having occurrences of the said resource.

It sounds not quite reasonable to ask to refactor the ontology to split DateValue by year in order to achieve that.

So we should either list the 1k occurrences through 40 requests paged by 25 or make 200 count requests (one by year)... or pre-compute the result (have a cron job do that regularly).

I'm juggling with ideas here because it won't be last case, so a recommendation is welcomed :)

@benjamingeer
Copy link

This sounds like an interesting use case that deserves some thought. In any case, though, I think DISTINCT isn’t what you want. In SPARQL, DISTINCT exists only for SELECT queries, not for CONSTRUCT queries, which always return a set of distinct triples. Since a Gravsearch query is a SPARQL CONSTRUCT query, whatever solution we find will have to fit within the syntax of a CONSTRUCT query.

I wonder if we could do something with VALUES, so you could supply all the years in one query.

I guess by “resource” you mean “resource class”?

@loicjaouen
Copy link
Contributor

I guess by “resource” you mean “resource class”?

yes, translating to a BEOL resource class, it could be:

PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX beol: <http://api.dasch.swiss/ontology/0801/beol/v2#>

CONSTRUCT {
  ?letter knora-api:isMainResource true .
  ?letter beol:creationDate ?date .
} WHERE {
  ?letter a knora-api:Resource .
  ?letter a beol:basicLetter .
  ?letter beol:creationDate ?date .
}

@benjamingeer
Copy link

benjamingeer commented Dec 3, 2019

After thinking about this some more, I think you have two problems here:

  1. You want years, which are not actually stored in your resources. Gravsearch can't give you this, because Gravsearch just returns what's in the resource.
  2. Knora stores dates as Julian Day Numbers, and there is no way to extract the year from the Julian Day Number in SPARQL; this requires a complex astronomical calculation, which can only be done in Knora itself.

I think the only possible solution is to add the year (represented as an integer) to the resource class.

@loicjaouen
Copy link
Contributor

@benjamingeer thanks for your answer.

  1. Knora stores dates as Julian Day Numbers

so we have to request the resource and then knora returns a representation of the value.
That representation includes DateValue --(knora-api:dateValueHasStartYear)--> integer.
But it is only thanks to knora that computes it on the fly because that year property is not stored as is in the tripple store.
It is not yet rocket science but already complex astronomical calculation.

  1. Gravsearch can't give you [year], because Gravsearch just returns what's in the resource

so @tobiasschweizer told me, but thinking about it, if year was actually stored in the tripple store as is, gravsearch querries can traverse the graph so why would year be one step too far in the path below?

beol:letter --(beol:creationDate)-> DateValue --(knora-api:dateValueHasStartYear)--> integer
Loading

@loicjaouen
Copy link
Contributor

Concerning adding a property year, it would be redundant and therefore prone to discrepancies.
Or we add a script that fills that field automatically.
Script that could live within knora, but I am pretty sure that we don't want a stored procedure mechanism, that would bring project dependant code inside of knora with extra maintenance efforts.
But if the script lives outside of knora, we can't garantee that the script will last, and thus, that the data will remain consistent.

@benjamingeer
Copy link

if year was actually stored in the tripple store as is, gravsearch querries can traverse the graph so why would year be one step too far in the path below?

That would require us to store it inside the DateValue, which I don't think we want to do, at least not yet. That would introduce the redundancy into knora-base, and require a lot of changes to existing code, all for (so far) only one project.

Knora's dates are designed to facilitate use cases where you want to search for dates and get resources. This has proved to be a very common requirement. But so far no one else has asked to search for resources and get parts of dates (the year, the month, or the day). I'm not saying it's an unreasonable request, but I think we can't yet justify a major redesign to support it.

In any case, this by itself wouldn't solve your problem, because Gravsearch would still return the whole DateValue, containing the whole date rather than just the year. You still wouldn't get a response containing a list of years.

Concerning adding a property year, it would be redundant and therefore prone to discrepancies.

How about making it a required property, filled in automatically by your GUI when the date value is created? That would prevent redundancies as long as the project is active. If the project is no longer active, nobody will be creating new date values in it anyway, so there will also be no new discrepancies.

@benjamingeer
Copy link

Actually, even putting the year in the resource isn't going to give you a list of years, because Gravsearch will still return a list of resources. I guess you could make a resource class Year, with one instance per year, and have each resource link to a Year.

@loicjaouen
Copy link
Contributor

How about making it a required property, filled in automatically by your GUI when the date value is created?

for this project, for editing, we have no other GUI but Salsah.

@benjamingeer
Copy link

I have an idea. If you made a GregorianYear class, with one instance per year (representing the first and last days of the year, using YEAR precision) and if we supported comparing a date variable with another date variable (instead of just with a literal), you could do something like this:

PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX example: <http://example.org/ontology/1234/example/simple/v2#>

CONSTRUCT {
  ?year knora-api:isMainResource true .
  ?year example:hasYearNumber ?yearNum .
} WHERE {
  ?resource a example:MyResourceClass .
  ?year a example:GregorianYear .
  ?resource example:hasDate ?resourceDate . # a knora-api:DateValue
  ?year example:hasDate ?yearDate . # a knora-api:DateValue
  FILTER(?resourceDate = ?yearDate) # Comparison not yet supported
  ?year example:hasYearNumber ?yearNum . # an integer
}

The FILTER would match if there was any overlap between ?resourceDate and ?yearDate.
This would return a list of Year resources, each with its integer year. You could create the Year resources once, and there would be nothing to maintain for each main resource.

This would require some work in the Gravsearch transpiler, but not a major redesign. @tobiasschweizer what do you think?

@musicEnfanthen
Copy link
Contributor

But so far no one else has asked to search for resources and get parts of dates (the year, the month, or the day).

Sorry to jump in, but this is kind of a non-argument. A pretty simple, and very common, use case for this kind of request would be a timeline. Asking for a resource, or even all resources, that are in a given dataset for a certain year, a certain month or day, or even a certain period of time (years, months, days), should be made possible without "complex astronomical calculations" at the side of the end user.

In our first Webern prototype (Angular1), I had included a timeline on a per day basis (asking for all events that occured in a given time span, Weberns life time, on the same day as the current date, kind of "On this day in history...").
I did so by reimplementing the JDN calculation algorithm from Salsah as a Angular1-plugin to calculate every JDN for lets say, every 3th December (Weberns birthday ;) between 1883 and 1945, and then send the list of JDNs to the API to get back the list of resources. I guess, this approach creates a lot of traffic, and even more, if your request is on a per year basis (generating a lot more JDNs). So far, I did not manage to update and implement this functionality in our new prototype (Angular2), but it is definitely a wanted feature.

@benjamingeer
Copy link

@musicEnfanthen

Asking for a resource, or even all resources, that are in a given dataset for a certain year, a certain month or day, or even a certain period of time (years, months, days), should be made possible without "complex astronomical calculations" at the side of the end user.

That's already supported in Gravsearch, but it's not what @loicjaouen is asking for. He wants a list of all the years that exist in all his resources. It's the opposite of what you're talking about.

@benjamingeer
Copy link

Another idea about this: we could add support for SELECT in Gravsearch queries. Then you could do the query with DISTINCT the way you want to.

@subotic subotic added this to the Backlog milestone Feb 7, 2020
@irinaschubert irinaschubert removed this from the Backlog milestone Dec 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Gravsearch question further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants