-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Slow response for the versions API call with large number of files or versions #9763
Comments
Thanks @tainguyenbui for the detailed writeup (with data!)! |
Moving to project board so @scolapasta and dev team can discuss and bring into a future sprint |
Unfortunally, We have a similar problem in our Dataverse. We have a dataset with 34,618 files, all stored in S3 storage. When someone makes a request to this dataset dataverse is slow in the response and the users are reporting the problem. Do they have any other suggestion for apply in dataverse? Regards KMIT - CIMMYT |
For datasets with this amount of files in Harvard Dataverse, we have
included instructions that access should be done via api instead of direct
access via the UI. We discourage the use of "download" all for a dataset
this size.
…On Thu, Nov 3, 2022 at 1:26 PM Gerardo Flores-Petlacalco < ***@***.***> wrote:
Unfortunally, We have a similar problem in our Dataverse. We have a
dataset with 34,618 files, all stored in S3 storage. When someone makes a
request to this dataset dataverse is slow in the response and the users are
reporting the problem.
We aumented the timeout response in Apache server for prevent the error
500, but we don't have any idea that occurs because in the logs don't
appear any registry error.
Do they have any other suggestion for apply in dataverse?
Regards
KMIT - CIMMYT
—
Reply to this email directly, view it on GitHub
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_6808-23issuecomment-2D1302442129&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=JigSWEy0N-vcev0ncda1XgQBkEr_XAiLW1KorZQIwfuQMlnDUdvVMTPFu6EG2N9a&s=rjjhEm_8koSBicpKpgPryDpyj7M9Y1OAipvta3xInJM&e=>,
or unsubscribe
<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AB7P2KT3ZZBT7ACGUYTSJITWGPYVDANCNFSM4MFKEVQA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=JigSWEy0N-vcev0ncda1XgQBkEr_XAiLW1KorZQIwfuQMlnDUdvVMTPFu6EG2N9a&s=dNPliDtTQQBvShJns-bNlMSa1Sy9VyuJG9P_rtBSre4&e=>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Sonia Barbosa
Manager of Data Curation, The Harvard Dataverse Repository
Manager of the Murray Research Archive <http://Murray.harvard.edu>, IQSS
The Dataverse Project <http://dataverse.org>
Data Science
Harvard University
Visit our Harvard Dataverse support website:
https://support.dataverse.harvard.edu/
Need to deposit data? Visit http://dataverse.harvard.edu
Harvard Library RDM services: <http://goog_1421170368>
https://hlrdm.library.harvard.edu/network
All Harvard Dataverse Repository inquiries should be sent to:
***@***.***
All software inquiries should be sent to: ***@***.***
Interested in sharing sensitive data? Coming soon to Harvard Dataverse:
http://datatags.org/
All test Dataverse Collections should be created in our demo environment:
https://demo.dataverse.org/
Join our Dataverse Community!
https://groups.google.com/forum/#!forum/dataverse-communit
<https://groups.google.com/forum/#!forum/dataverse-community>y
|
@Gerafp when you say "make a request", do you mean download? If so, you could try wget with the ne-ish "dirindex" view of files: https://guides.dataverse.org/en/5.12/api/native-api.html#view-dataset-files-and-folders-as-a-directory-index |
Hi @pdurbin Thanks for your response. I refer when a user checks the dataset's page or when a curator needs to edit the metadata. |
Last time (3 weeks ago), we took some notes here: #8928 (comment) Huge issue. Not clear. Not ready to be estimated. Our current strategy is a workaround, to not have 30,000 files in a single dataset. We'd like to use this issue as an umbrella issue for all problems regarding datasets with too many files. We should split off separate, smaller chunks. |
Sizing
|
Sizing:
Next steps:
|
In case this is helpful: When I use the APIs to get the metadata of all versions of a dataset, and the API call is slow because the dataset has many versions and might also have many files (and maybe a lot of metadata), I've looked for a way to break the task into multiple API calls, one for each version of the dataset. In one API call, we can get the metadata of a particular version of a dataset. But as far as I know, the only way to use the APIs to get a list of version numbers of a dataset is to use the endpoint that returns information about all versions of the dataset, which is the very API call I would like to avoid using. So I'm wondering if another solution might be better support for getting information about all dataset versions, by making it easier to break the task into multiple API calls. For example, if there was an endpoint that returns the list of version numbers of a given dataset, I could then use that list to make multiple API calls, one for each version. |
Not expanding all the information would definitely help in the payload size. I've not looked at the data structure of versions, but if we could do some lazy search it could also speeds up the query. Your solution would have been good enough for us, despite of needing metadata and files once the version is selected. So, as a result, two endpoints:
However, take into consideration that it could mean breaking changes for the existing applications using the |
… the (default) output of the api. (#9763)
@GPortas I just want to confirm quickly that you are ok with what I'm doing in this branch wrt the
I'm very open to any input/if any changes are needed. |
To summarize, in this branch I'm addressing the performance issues in the versions api via a combination of the following approaches:
The following real life datasets from IQSS prod. service are used in the sample tests below:
Note: Note: tests above were run on the dedicated IQSS test system (not in prod., which is a beefier and faster system). I will add more info, and will include this in the pr as well. |
(to clarify, the results in the last update are with the extra "citation date" logic commented out from the code; I am working on addressing that) |
… api (also being added in 6.1). #9763
… filemetadatas retrieval method, not directly used in the PR). (#9763)
avoid conflict with V6.0.0.1__9599-guestbook-at-request.sql
Hi Dataverse Team,
I just wanted to share with you something that affects datasets with high number of versions or files.
There have been some previously related issues that were already solved/closed. However, I strongly believe that the problem may still happen. I will backup the issue with some interesting data.
The below problem is happening when retrieving
dataset versions information
through the native API, hitting the endpoint:http://demo.dataverse.org/api/datasets/<dataset-version>/versions
Given that I am a user with large many files and versions in a dataset
When I retrieve all the dataset versions
Then I would like to receive a fairly "fast" response
So the user experience is smooth
Current behavior
When the dataset has a large number of files, and also a large number of versions, the response time increments dramatically. This can be seen in the below table
One of our concerns here is the fact that the dataset with id
767863
already takes a long time only having 4 versions, which means that once it reaches for instance 10 versions, it may be easily taking more than 20 seconds to respond, and could potentially cause a timeout in Dataverse.Additionally, Dataverse currently returns as part of the response to all the files and their metadata for each of the versions available. That causes a very large response payload that may be unnecessary.
Note: The number of files also seem to affect the speed at which a version is published
Expected behavior
To have the ability to retrieve dataset versions information in an efficient way that does not impact massively the response time.
Possible solution
To return basic metadata about each of the dataset versions available without file information that could potentially be the real problem.
To review whether there are possible parallelization improvements
It is likely that the user will not necessarily need all the information for each of the datasets, normally, they would click on the version they are interested in, where we could actually perform another request such as
http://demo.dataverse.org/api/datasets/<dataset-id>/versions/<version>
Thanks a lot!
The text was updated successfully, but these errors were encountered: