Gravsearch query distinct property values #1319

gfoo · 2019-05-14T08:53:06Z

Is there a more effective way to get the list of distinct values of a property (hasJournalTitle) other than to go through all the pages?
Actually, I do that in the background on startup and it works fine, this list is used in a combo box (progressively filled) of an adv search UI. It takes about 1 min to cover about 200 hundreds pages, there are many duplicates, may be there is a way to reduce the number of pages?

PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX ll: <http://0.0.0.0:3333/ontology/0113/lumieres-lausanne/simple/v2#>
CONSTRUCT {
  ?BibliographicNotice knora-api:isMainResource true .
  ?BibliographicNotice ll:hasJournalTitle ?hasJournalTitle .
} WHERE {
  ?BibliographicNotice a knora-api:Resource .
  ?BibliographicNotice a ll:BibliographicNotice .  
  ?BibliographicNotice ll:hasJournalTitle ?hasJournalTitle .
}
ORDER BY ASC(?hasJournalTitle)

The text was updated successfully, but these errors were encountered:

benjamingeer · 2019-05-14T08:57:48Z

One possibility would be to normalise your data model. You could have a Journal resource class, with one instance for each distinct title. Instead of the hasJournalTitle property, you would have a hasJournal property, pointing to the IRI of a Journal. Then you could get all the distinct journal titles by querying all the Journal instances.

gfoo · 2019-05-14T09:05:48Z

Yep, we thought about that or to use a list, but before to add one more resource I wanted to know if I was missing a solution with our current model because I've seen that the DISTINCT keyword could be used with CONSTRUCT : https://stackoverflow.com/questions/55266711/sparql-construct-with-distinct

gfoo · 2019-05-14T09:07:15Z

anyway my current solution works fine, there is a very low probability that a user wants the full list before the app finished to load all the data.

benjamingeer · 2019-05-14T09:09:49Z

In SPARQL, DISTINCT can be used in a SELECT subquery nested in a CONSTRUCT query, but Gravsearch doesn't support subqueries.

Doing a query that takes a minute per user session seems like it's going to put an unnecessary load on the triplestore.

There could also be other advantages to normalising the data model; you might want to store other information about these journals.

mrivoal · 2019-05-14T09:14:14Z

Hi Ben.

The current data model is the result of the migration of a not-so-great SQL data model in the first place. Given the amount of data, the burden of the import and the amount of time we have already spend on this project and the way the current users see their database, in all this respect, normalising the data model is not the best option. Even if, model-wise, it makes sense, of course.

gfoo · 2019-05-14T09:14:34Z

Doing a query that takes a minute per user session seems like it's going to put an unnecessary load on the triplestore.

no, 200 queries (pages). I could also add some delay between queries to let the triplestroe take some breath :)

benjamingeer · 2019-05-14T09:17:13Z

I'm just saying that those 200 queries have to be run at the start of every user session, which could consume a lot of resources if there are a lot of users (e.g. a classroom full of students all opening the same page at the same time). Of course, I understand that you might not have time to do this now, but I think it would make sense to do it.

benjamingeer · 2019-05-14T09:18:20Z

the way the current users see their database

If you overload the triplestore with queries, they might not see anything at all...

mrivoal · 2019-05-14T09:30:23Z

Amongst the parameters that I bear in mind to design a data model, I have to say that I never thought about not overloading the triplestore.

The data model has to mean something eventually for the user, conceptually speaking, it also has to mean something in a generic GUI, and users have to understand it well enough to make some queries. And if they don't want to attach any metadata to a journal, then I think there is no point introducing a new class which will make the data model even more (and unnecessarily) complicated.

I think we should look for another workaround here.

benjamingeer · 2019-05-14T09:34:55Z

Amongst the parameters that I bear in mind to design a data model, I have to say that I never thought about not overloading the triplestore.

This is absolutely something you must bear in mind all the time. If a data model requires queries that will be too slow, or will consume excessive resources, it has to be designed differently. You're going to be sharing the DaSCH triplestore with all other projects in the DaSCH. We in the DaSCH have to make sure that no project is going to consume excessive resources, which could cause problems for everyone else.

I feel like I keep repeating this a lot: triplestore CPU cycles are a scarce resource. They have to be used very frugally.

The data model has to mean something eventually for the user

If a user can understand the concept of a journal title, it seems to me that they can certainly understand the concept of a journal.

benjamingeer · 2019-05-14T10:03:31Z

Just to put this in context: in the past 5 years, the Knora developers have spent a lot of time working on improving the performance of SPARQL queries, and on implementing things in Knora in ways that will allow it to scale (e.g. by caching query results). That implementation work is one part of the solution. Another part is the money spent on triplestore licences, and perhaps eventually a triplestore cluster. But projects have to be involved in thinking about how to make their code scalable, too. All these things together are necessary to make the DaSCH infrastructure perform well for everyone. We have to conserve resources wherever we can, especially because we're building a system that responds in real time. Your bank doesn't process your transactions in real time; it processes everything in a batch job late at night, because that's more efficient. We will probably have to do that for some kinds of operations, too (e.g. bulk imports). But we need a system that can do a lot of interesting work in real time, and that means we have to be very careful about designing requests so they can scale. Sometimes that will mean eliminating redundancy in a data model. If we can replace 200 queries with 20 queries, that seems like a big win to me.

mrivoal · 2019-05-14T10:29:03Z

I am still not convinced that technical limitations should drive and restrain the data model of a research project...

Besides, there is often a gap between the ideal data model that you may have in mind for a project and its implementation because, eventually, the researcher is the one who decides how his data will be shaped (or he should at least share your views on the model...). I see myself as enforcing good practices in as much as these good practices don't interfere with the research process, but it is a bargaining. If good practices interfere too much, researchers choose another software. I understand your point of view, but you should also understand mine. Data models need to get the approval of researchers. Or they will walk away.

But projects have to be involved in thinking about how to make their code scalable, too.

Yes, but you should know that the understanding of research teams on this matter can be limited. And enforcing a research project to use a well-structured bibliography model is not a little thing. Especially when the projects is already a complicated one.

benjamingeer · 2019-05-14T10:49:36Z

I am still not convinced that technical limitations should drive and restrain the data model of a research project...

Ask any experienced software developer who has worked on database design. Database design is always done with performance considerations in mind. Do a Google search for "database design performance" and you will find a ton of material on this. It doesn't matter whether your database contains research data, financial data, or biomedical data: if your data model doesn't allow your queries to run efficiently, your project will fail.

it is a bargaining

Absolutely. But researchers need to understand that there are technical constraints with all software systems, and that they are sharing infrastructure with many other people. I encourage you and the researchers to think of this as something like an environmental issue. No one can be allowed to pollute the environment so much that it becomes uninhabitable for everyone else.

If you make too many requests per second to Facebook's API, Facebook will block your requests. This is called "rate limiting". Knora doesn't currently implement rate limiting, but perhaps we will have to. And we are not Facebook, with practically infinite resources to spend. We are doing the best we can with the resources we have.

If good practices interfere too much

You have not explained to me how introducing the concept of "Journal" along with "Journal title" would interfere with anything.

enforcing a research project to use a well-structured bibliography model is not a little thing

I'm not asking for that. I'm just asking you to avoid making 200 queries that return mostly redundant data, which is just basic good software development practice. I don't think that's asking for the moon.

gfoo · 2019-05-14T12:09:19Z

there are technical constraints with all software systems,

for sure, even with a relational database we probably should add an index and/or cache to get correct performances.

The dev context is difficult for everyone. We don't have the resources and time to migrate the model and data from a fine tuned database to Knora and in same time upgrade their complex web app to modern UX design concepts and framework. So we preferred to reproduce the same old fashioned web site with a descent framework (Angular), even if it means from time to time finds ugly (temporarily) workarounds :)

That being said, back to my first problem: is that make sense to get all these values? no, it is a workaround and quite stupid solution. From about 5 000 values (200 pages x 25), I keep only 256 distinct values, mutli-selectable in a combo box. Not to mention the fact that nobody will read all those values! The only solution I see is to provide a search textfield filling the combo box according to the textfield (imitating the Salsah way to link resources). But as usual, it takes a little longer to develop.

This is probably the cost to pay in modern web app, provides many many data with smooth and reactive user interface, and all that with just one developer. This is almost the moon ;-)

musicEnfanthen · 2019-05-14T12:36:58Z

In SPARQL, DISTINCT can be used in a SELECT subquery nested in a CONSTRUCT query, but Gravsearch doesn't support subqueries.

Is there a special reason for that? Using DISTINCT in a SELECT subquery seems to reduce the resource costs very effectively, by 20, here (going from 5000 (200 pages) values to only 256 results (11 pages))

benjamingeer · 2019-05-14T12:46:58Z

Is there a special reason for that?

I would guess it’s because every CONSTRUCT result is a set of triples, and the members of a set are distinct by definition.

benjamingeer · 2019-05-15T09:36:45Z

The only solution I see is to provide a search textfield filling the combo box according to the textfield (imitating the Salsah way to link resources). But as usual, it takes a little longer to develop.

If it helps, there's an API v2 route for this:

https://docs.knora.org/paradox/03-apis/api-v2/reading-and-searching-resources.html#search-for-a-resource-by-its-rdfs-label

This is probably the cost to pay in modern web app, provides many many data with smooth and reactive user interface, and all that with just one developer. This is almost the moon ;-)

I understand completely. I'm not saying you have to do this now, I'm just saying please do it later when you have time.

loicjaouen · 2019-12-02T16:43:50Z

A distinct would still be really helpful. We have another project that requires that.

It has a resource with a date property, one thousand occurrences of this resource spanning over two hundreds years.
In order to show the resources the team wants to print the years having occurrences of the said resource.

It sounds not quite reasonable to ask to refactor the ontology to split DateValue by year in order to achieve that.

So we should either list the 1k occurrences through 40 requests paged by 25 or make 200 count requests (one by year)... or pre-compute the result (have a cron job do that regularly).

I'm juggling with ideas here because it won't be last case, so a recommendation is welcomed :)

benjamingeer · 2019-12-02T17:16:52Z

This sounds like an interesting use case that deserves some thought. In any case, though, I think DISTINCT isn’t what you want. In SPARQL, DISTINCT exists only for SELECT queries, not for CONSTRUCT queries, which always return a set of distinct triples. Since a Gravsearch query is a SPARQL CONSTRUCT query, whatever solution we find will have to fit within the syntax of a CONSTRUCT query.

I wonder if we could do something with VALUES, so you could supply all the years in one query.

I guess by “resource” you mean “resource class”?

loicjaouen · 2019-12-02T21:13:41Z

I guess by “resource” you mean “resource class”?

yes, translating to a BEOL resource class, it could be:

PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX beol: <http://api.dasch.swiss/ontology/0801/beol/v2#>

CONSTRUCT {
  ?letter knora-api:isMainResource true .
  ?letter beol:creationDate ?date .
} WHERE {
  ?letter a knora-api:Resource .
  ?letter a beol:basicLetter .
  ?letter beol:creationDate ?date .
}

benjamingeer · 2019-12-03T10:28:23Z

After thinking about this some more, I think you have two problems here:

You want years, which are not actually stored in your resources. Gravsearch can't give you this, because Gravsearch just returns what's in the resource.
Knora stores dates as Julian Day Numbers, and there is no way to extract the year from the Julian Day Number in SPARQL; this requires a complex astronomical calculation, which can only be done in Knora itself.

I think the only possible solution is to add the year (represented as an integer) to the resource class.

loicjaouen · 2019-12-03T11:50:36Z

@benjamingeer thanks for your answer.

Knora stores dates as Julian Day Numbers

so we have to request the resource and then knora returns a representation of the value.
That representation includes DateValue --(knora-api:dateValueHasStartYear)--> integer.
But it is only thanks to knora that computes it on the fly because that year property is not stored as is in the tripple store.
It is not yet rocket science but already complex astronomical calculation.

Gravsearch can't give you [year], because Gravsearch just returns what's in the resource

so @tobiasschweizer told me, but thinking about it, if year was actually stored in the tripple store as is, gravsearch querries can traverse the graph so why would year be one step too far in the path below?

beol:letter --(beol:creationDate)-> DateValue --(knora-api:dateValueHasStartYear)--> integer

loicjaouen · 2019-12-03T13:14:59Z

Concerning adding a property year, it would be redundant and therefore prone to discrepancies.
Or we add a script that fills that field automatically.
Script that could live within knora, but I am pretty sure that we don't want a stored procedure mechanism, that would bring project dependant code inside of knora with extra maintenance efforts.
But if the script lives outside of knora, we can't garantee that the script will last, and thus, that the data will remain consistent.

benjamingeer · 2019-12-03T13:38:33Z

if year was actually stored in the tripple store as is, gravsearch querries can traverse the graph so why would year be one step too far in the path below?

That would require us to store it inside the DateValue, which I don't think we want to do, at least not yet. That would introduce the redundancy into knora-base, and require a lot of changes to existing code, all for (so far) only one project.

Knora's dates are designed to facilitate use cases where you want to search for dates and get resources. This has proved to be a very common requirement. But so far no one else has asked to search for resources and get parts of dates (the year, the month, or the day). I'm not saying it's an unreasonable request, but I think we can't yet justify a major redesign to support it.

In any case, this by itself wouldn't solve your problem, because Gravsearch would still return the whole DateValue, containing the whole date rather than just the year. You still wouldn't get a response containing a list of years.

Concerning adding a property year, it would be redundant and therefore prone to discrepancies.

How about making it a required property, filled in automatically by your GUI when the date value is created? That would prevent redundancies as long as the project is active. If the project is no longer active, nobody will be creating new date values in it anyway, so there will also be no new discrepancies.

benjamingeer · 2019-12-03T13:45:14Z

Actually, even putting the year in the resource isn't going to give you a list of years, because Gravsearch will still return a list of resources. I guess you could make a resource class Year, with one instance per year, and have each resource link to a Year.

loicjaouen · 2019-12-03T13:45:15Z

How about making it a required property, filled in automatically by your GUI when the date value is created?

for this project, for editing, we have no other GUI but Salsah.

benjamingeer · 2019-12-03T14:10:09Z

I have an idea. If you made a GregorianYear class, with one instance per year (representing the first and last days of the year, using YEAR precision) and if we supported comparing a date variable with another date variable (instead of just with a literal), you could do something like this:

PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX example: <http://example.org/ontology/1234/example/simple/v2#>

CONSTRUCT {
  ?year knora-api:isMainResource true .
  ?year example:hasYearNumber ?yearNum .
} WHERE {
  ?resource a example:MyResourceClass .
  ?year a example:GregorianYear .
  ?resource example:hasDate ?resourceDate . # a knora-api:DateValue
  ?year example:hasDate ?yearDate . # a knora-api:DateValue
  FILTER(?resourceDate = ?yearDate) # Comparison not yet supported
  ?year example:hasYearNumber ?yearNum . # an integer
}

The FILTER would match if there was any overlap between ?resourceDate and ?yearDate.
This would return a list of Year resources, each with its integer year. You could create the Year resources once, and there would be nothing to maintain for each main resource.

This would require some work in the Gravsearch transpiler, but not a major redesign. @tobiasschweizer what do you think?

musicEnfanthen · 2019-12-03T14:12:40Z

But so far no one else has asked to search for resources and get parts of dates (the year, the month, or the day).

Sorry to jump in, but this is kind of a non-argument. A pretty simple, and very common, use case for this kind of request would be a timeline. Asking for a resource, or even all resources, that are in a given dataset for a certain year, a certain month or day, or even a certain period of time (years, months, days), should be made possible without "complex astronomical calculations" at the side of the end user.

In our first Webern prototype (Angular1), I had included a timeline on a per day basis (asking for all events that occured in a given time span, Weberns life time, on the same day as the current date, kind of "On this day in history...").
I did so by reimplementing the JDN calculation algorithm from Salsah as a Angular1-plugin to calculate every JDN for lets say, every 3th December (Weberns birthday ;) between 1883 and 1945, and then send the list of JDNs to the API to get back the list of resources. I guess, this approach creates a lot of traffic, and even more, if your request is on a per year basis (generating a lot more JDNs). So far, I did not manage to update and implement this functionality in our new prototype (Angular2), but it is definitely a wanted feature.

benjamingeer · 2019-12-03T14:15:47Z

@musicEnfanthen

Asking for a resource, or even all resources, that are in a given dataset for a certain year, a certain month or day, or even a certain period of time (years, months, days), should be made possible without "complex astronomical calculations" at the side of the end user.

That's already supported in Gravsearch, but it's not what @loicjaouen is asking for. He wants a list of all the years that exist in all his resources. It's the opposite of what you're talking about.

benjamingeer · 2019-12-03T14:16:30Z

@musicEnfanthen https://docs.knora.org/paradox/03-apis/api-v2/query-language.html#date-comparisons

benjamingeer · 2020-02-06T10:32:50Z

Another idea about this: we could add support for SELECT in Gravsearch queries. Then you could do the query with DISTINCT the way you want to.

gfoo added question further information is requested Gravsearch labels May 14, 2019

gfoo changed the title ~~Gravsearch query distinct properties value~~ Gravsearch query distinct property values May 14, 2019

benjamingeer assigned benjamingeer and tobiasschweizer Dec 3, 2019

benjamingeer mentioned this issue Dec 3, 2019

Allow comparing two variables in Gravsearch #1539

Open

benjamingeer mentioned this issue Jan 16, 2020

Gravsearch result sorted on linked resource's property #1575

Open

subotic added this to the Backlog milestone Feb 7, 2020

benjamingeer mentioned this issue Feb 28, 2020

Consider supporting SELECT in Gravsearch #1619

Open

irinaschubert removed this from the Backlog milestone Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gravsearch query distinct property values #1319

Gravsearch query distinct property values #1319

gfoo commented May 14, 2019

benjamingeer commented May 14, 2019 •

edited

Loading

gfoo commented May 14, 2019

gfoo commented May 14, 2019

benjamingeer commented May 14, 2019

mrivoal commented May 14, 2019

gfoo commented May 14, 2019

benjamingeer commented May 14, 2019

benjamingeer commented May 14, 2019

mrivoal commented May 14, 2019

benjamingeer commented May 14, 2019

benjamingeer commented May 14, 2019

mrivoal commented May 14, 2019

benjamingeer commented May 14, 2019 •

edited

Loading

gfoo commented May 14, 2019

musicEnfanthen commented May 14, 2019

benjamingeer commented May 14, 2019

benjamingeer commented May 15, 2019

loicjaouen commented Dec 2, 2019

benjamingeer commented Dec 2, 2019

loicjaouen commented Dec 2, 2019

benjamingeer commented Dec 3, 2019 •

edited

Loading

loicjaouen commented Dec 3, 2019

loicjaouen commented Dec 3, 2019

benjamingeer commented Dec 3, 2019

benjamingeer commented Dec 3, 2019

loicjaouen commented Dec 3, 2019

benjamingeer commented Dec 3, 2019

musicEnfanthen commented Dec 3, 2019

benjamingeer commented Dec 3, 2019

benjamingeer commented Dec 3, 2019

benjamingeer commented Feb 6, 2020

Gravsearch query distinct property values #1319

Gravsearch query distinct property values #1319

Comments

gfoo commented May 14, 2019

benjamingeer commented May 14, 2019 • edited Loading

gfoo commented May 14, 2019

gfoo commented May 14, 2019

benjamingeer commented May 14, 2019

mrivoal commented May 14, 2019

gfoo commented May 14, 2019

benjamingeer commented May 14, 2019

benjamingeer commented May 14, 2019

mrivoal commented May 14, 2019

benjamingeer commented May 14, 2019

benjamingeer commented May 14, 2019

mrivoal commented May 14, 2019

benjamingeer commented May 14, 2019 • edited Loading

gfoo commented May 14, 2019

musicEnfanthen commented May 14, 2019

benjamingeer commented May 14, 2019

benjamingeer commented May 15, 2019

loicjaouen commented Dec 2, 2019

benjamingeer commented Dec 2, 2019

loicjaouen commented Dec 2, 2019

benjamingeer commented Dec 3, 2019 • edited Loading

loicjaouen commented Dec 3, 2019

loicjaouen commented Dec 3, 2019

benjamingeer commented Dec 3, 2019

benjamingeer commented Dec 3, 2019

loicjaouen commented Dec 3, 2019

benjamingeer commented Dec 3, 2019

musicEnfanthen commented Dec 3, 2019

benjamingeer commented Dec 3, 2019

benjamingeer commented Dec 3, 2019

benjamingeer commented Feb 6, 2020

benjamingeer commented May 14, 2019 •

edited

Loading

benjamingeer commented May 14, 2019 •

edited

Loading

benjamingeer commented Dec 3, 2019 •

edited

Loading