Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large amount of queries when getting a dataset with API #9683

Closed
ErykKul opened this issue Jun 28, 2023 · 3 comments · Fixed by #9684
Closed

Large amount of queries when getting a dataset with API #9683

ErykKul opened this issue Jun 28, 2023 · 3 comments · Fixed by #9684

Comments

@ErykKul
Copy link
Collaborator

ErykKul commented Jun 28, 2023

What steps does it take to reproduce the issue?
Retrieve a large dataset (containing thousands of files) through the API

  • When does this issue occur?
    It is the same for all datasets, but is more problematic for larger datasets.

  • Which page(s) does it occur on?
    API calls

  • What happens?
    When we check the query log, we see multiple queries for each file of the dataset.

  • To whom does it occur (all users, curators, superusers)?
    All users.

  • What did you expect to happen?
    One larger query that retrieves all the necessary data at once.

Which version of Dataverse are you using?
Develop.

Any related open or closed issues to this bug report?

A better solution would be to have an API that retrieves only the dataset and a separate API calls for retrieving the file metadata in a paginated way, as proposed by @Kris-LIBIS. However, many existing applications already use the retrieve dataset API call, including all file metadata in one call, therefore, making it more efficient should be beneficial too.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Sep 1, 2023

An API call that retrieves only the dataset-metadata, without the files-metadata, would be a nice improvement for our tools using the API. In most cases, we are only interested in the dataset-metadata, not the files. This makes the current solution too heavy for large datasets with many files, when we even skip some datasets due to the one-minute timeout.

@jggautier
Copy link
Contributor

#9763 is related, right? The solution being described there seems very relevant. Sorry if you're already aware of this and I'm just adding noise. I'm pretty interested in these improvements, too :)

@ErykKul
Copy link
Collaborator Author

ErykKul commented Sep 1, 2023

I was not aware of that issue, but it is related. I did reopen the PR and did the merge, it might be worth considering/testing it. It should improve the response times for datasets with many files using the regular retrieve metadata call for dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants