-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gravsearch query distinct property values #1319
Comments
One possibility would be to normalise your data model. You could have a |
Yep, we thought about that or to use a list, but before to add one more resource I wanted to know if I was missing a solution with our current model because I've seen that the |
anyway my current solution works fine, there is a very low probability that a user wants the full list before the app finished to load all the data. |
In SPARQL, Doing a query that takes a minute per user session seems like it's going to put an unnecessary load on the triplestore. There could also be other advantages to normalising the data model; you might want to store other information about these journals. |
Hi Ben. The current data model is the result of the migration of a not-so-great SQL data model in the first place. Given the amount of data, the burden of the import and the amount of time we have already spend on this project and the way the current users see their database, in all this respect, normalising the data model is not the best option. Even if, model-wise, it makes sense, of course. |
no, 200 queries (pages). I could also add some delay between queries to let the triplestroe take some breath :) |
I'm just saying that those 200 queries have to be run at the start of every user session, which could consume a lot of resources if there are a lot of users (e.g. a classroom full of students all opening the same page at the same time). Of course, I understand that you might not have time to do this now, but I think it would make sense to do it. |
If you overload the triplestore with queries, they might not see anything at all... |
Amongst the parameters that I bear in mind to design a data model, I have to say that I never thought about not overloading the triplestore. The data model has to mean something eventually for the user, conceptually speaking, it also has to mean something in a generic GUI, and users have to understand it well enough to make some queries. And if they don't want to attach any metadata to a journal, then I think there is no point introducing a new class which will make the data model even more (and unnecessarily) complicated. I think we should look for another workaround here. |
This is absolutely something you must bear in mind all the time. If a data model requires queries that will be too slow, or will consume excessive resources, it has to be designed differently. You're going to be sharing the DaSCH triplestore with all other projects in the DaSCH. We in the DaSCH have to make sure that no project is going to consume excessive resources, which could cause problems for everyone else. I feel like I keep repeating this a lot: triplestore CPU cycles are a scarce resource. They have to be used very frugally.
If a user can understand the concept of a journal title, it seems to me that they can certainly understand the concept of a journal. |
Just to put this in context: in the past 5 years, the Knora developers have spent a lot of time working on improving the performance of SPARQL queries, and on implementing things in Knora in ways that will allow it to scale (e.g. by caching query results). That implementation work is one part of the solution. Another part is the money spent on triplestore licences, and perhaps eventually a triplestore cluster. But projects have to be involved in thinking about how to make their code scalable, too. All these things together are necessary to make the DaSCH infrastructure perform well for everyone. We have to conserve resources wherever we can, especially because we're building a system that responds in real time. Your bank doesn't process your transactions in real time; it processes everything in a batch job late at night, because that's more efficient. We will probably have to do that for some kinds of operations, too (e.g. bulk imports). But we need a system that can do a lot of interesting work in real time, and that means we have to be very careful about designing requests so they can scale. Sometimes that will mean eliminating redundancy in a data model. If we can replace 200 queries with 20 queries, that seems like a big win to me. |
I am still not convinced that technical limitations should drive and restrain the data model of a research project... Besides, there is often a gap between the ideal data model that you may have in mind for a project and its implementation because, eventually, the researcher is the one who decides how his data will be shaped (or he should at least share your views on the model...). I see myself as enforcing good practices in as much as these good practices don't interfere with the research process, but it is a bargaining. If good practices interfere too much, researchers choose another software. I understand your point of view, but you should also understand mine. Data models need to get the approval of researchers. Or they will walk away.
Yes, but you should know that the understanding of research teams on this matter can be limited. And enforcing a research project to use a well-structured bibliography model is not a little thing. Especially when the projects is already a complicated one. |
Ask any experienced software developer who has worked on database design. Database design is always done with performance considerations in mind. Do a Google search for "database design performance" and you will find a ton of material on this. It doesn't matter whether your database contains research data, financial data, or biomedical data: if your data model doesn't allow your queries to run efficiently, your project will fail.
Absolutely. But researchers need to understand that there are technical constraints with all software systems, and that they are sharing infrastructure with many other people. I encourage you and the researchers to think of this as something like an environmental issue. No one can be allowed to pollute the environment so much that it becomes uninhabitable for everyone else. If you make too many requests per second to Facebook's API, Facebook will block your requests. This is called "rate limiting". Knora doesn't currently implement rate limiting, but perhaps we will have to. And we are not Facebook, with practically infinite resources to spend. We are doing the best we can with the resources we have.
You have not explained to me how introducing the concept of "Journal" along with "Journal title" would interfere with anything.
I'm not asking for that. I'm just asking you to avoid making 200 queries that return mostly redundant data, which is just basic good software development practice. I don't think that's asking for the moon. |
for sure, even with a relational database we probably should add an index and/or cache to get correct performances. The dev context is difficult for everyone. We don't have the resources and time to migrate the model and data from a fine tuned database to Knora and in same time upgrade their complex web app to modern UX design concepts and framework. So we preferred to reproduce the same old fashioned web site with a descent framework (Angular), even if it means from time to time finds ugly (temporarily) workarounds :) That being said, back to my first problem: is that make sense to get all these values? no, it is a workaround and quite stupid solution. From about 5 000 values (200 pages x 25), I keep only 256 distinct values, mutli-selectable in a combo box. Not to mention the fact that nobody will read all those values! The only solution I see is to provide a search textfield filling the combo box according to the textfield (imitating the Salsah way to link resources). But as usual, it takes a little longer to develop. This is probably the cost to pay in modern web app, provides many many data with smooth and reactive user interface, and all that with just one developer. This is almost the moon ;-) |
Is there a special reason for that? Using |
I would guess it’s because every CONSTRUCT result is a set of triples, and the members of a set are distinct by definition. |
If it helps, there's an API v2 route for this:
I understand completely. I'm not saying you have to do this now, I'm just saying please do it later when you have time. |
A It has a resource with a date property, one thousand occurrences of this resource spanning over two hundreds years. It sounds not quite reasonable to ask to refactor the ontology to split DateValue by year in order to achieve that. So we should either list the 1k occurrences through 40 requests paged by 25 or make 200 count requests (one by year)... or pre-compute the result (have a cron job do that regularly). I'm juggling with ideas here because it won't be last case, so a recommendation is welcomed :) |
This sounds like an interesting use case that deserves some thought. In any case, though, I think I wonder if we could do something with I guess by “resource” you mean “resource class”? |
yes, translating to a BEOL resource class, it could be: PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX beol: <http://api.dasch.swiss/ontology/0801/beol/v2#>
CONSTRUCT {
?letter knora-api:isMainResource true .
?letter beol:creationDate ?date .
} WHERE {
?letter a knora-api:Resource .
?letter a beol:basicLetter .
?letter beol:creationDate ?date .
} |
After thinking about this some more, I think you have two problems here:
I think the only possible solution is to add the year (represented as an integer) to the resource class. |
@benjamingeer thanks for your answer.
so we have to request the resource and then knora returns a representation of the value.
so @tobiasschweizer told me, but thinking about it, if beol:letter --(beol:creationDate)-> DateValue --(knora-api:dateValueHasStartYear)--> integer
|
Concerning adding a property |
That would require us to store it inside the Knora's dates are designed to facilitate use cases where you want to search for dates and get resources. This has proved to be a very common requirement. But so far no one else has asked to search for resources and get parts of dates (the year, the month, or the day). I'm not saying it's an unreasonable request, but I think we can't yet justify a major redesign to support it. In any case, this by itself wouldn't solve your problem, because Gravsearch would still return the whole
How about making it a required property, filled in automatically by your GUI when the date value is created? That would prevent redundancies as long as the project is active. If the project is no longer active, nobody will be creating new date values in it anyway, so there will also be no new discrepancies. |
Actually, even putting the year in the resource isn't going to give you a list of years, because Gravsearch will still return a list of resources. I guess you could make a resource class |
for this project, for editing, we have no other GUI but Salsah. |
I have an idea. If you made a PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX example: <http://example.org/ontology/1234/example/simple/v2#>
CONSTRUCT {
?year knora-api:isMainResource true .
?year example:hasYearNumber ?yearNum .
} WHERE {
?resource a example:MyResourceClass .
?year a example:GregorianYear .
?resource example:hasDate ?resourceDate . # a knora-api:DateValue
?year example:hasDate ?yearDate . # a knora-api:DateValue
FILTER(?resourceDate = ?yearDate) # Comparison not yet supported
?year example:hasYearNumber ?yearNum . # an integer
} The This would require some work in the Gravsearch transpiler, but not a major redesign. @tobiasschweizer what do you think? |
Sorry to jump in, but this is kind of a non-argument. A pretty simple, and very common, use case for this kind of request would be a timeline. Asking for a resource, or even all resources, that are in a given dataset for a certain year, a certain month or day, or even a certain period of time (years, months, days), should be made possible without "complex astronomical calculations" at the side of the end user. In our first Webern prototype (Angular1), I had included a timeline on a per day basis (asking for all events that occured in a given time span, Weberns life time, on the same day as the current date, kind of "On this day in history..."). |
That's already supported in Gravsearch, but it's not what @loicjaouen is asking for. He wants a list of all the years that exist in all his resources. It's the opposite of what you're talking about. |
Another idea about this: we could add support for |
Is there a more effective way to get the list of distinct values of a property (
hasJournalTitle
) other than to go through all the pages?Actually, I do that in the background on startup and it works fine, this list is used in a combo box (progressively filled) of an adv search UI. It takes about 1 min to cover about 200 hundreds pages, there are many duplicates, may be there is a way to reduce the number of pages?
The text was updated successfully, but these errors were encountered: