Deliverable: Slow response for datasets with high number of files #29

mreekie · 2023-03-22T15:38:14Z

This is an umbrella to be used for all the issues we are seeing related to the title.
What follows is a rough cut of what's been discussed thus far that would need to be done to complete this deliverable.

As a "bklog: Deliverable" This is decomposed into smaller issues.

Each of the smaller issues gets the label "D: Dataset: large number of files"
This issue, the only issue to have both labels ("bklog: Deliverable", "D: Dataset: large number of files"), will stay in the Dataverse_Global_Backlog project forever.
It will stay in it's present column until the team feels like the issue has no smaller issues that need need to be broken off in order to resolve this issue
At that point, this issue stays in the Dataverse_Global_backlog, but changes it's status to: "Clear of the Backlog"

We're talking about this issue in tech hours. Here are some pain points for users:

Editing the title (or other piece) when there are many files (30,000). The save is prohibitively expensive. Affects depositors. Maybe removing the cascade will help. There are two cascades. We could write tests with the 1000files.zip file vs 1 file. How long to edit the title?
Slow indexing of a dataset with 30,000 files. Affects sysadmins.
Only 20 files but many versions. Slow to make the next version. Multiplying affect. Affects depositors. Reindexing is slow?
...

Other discussion:

Does the new zip previewer/downloader help?
Creating a large JSON for the tree view?
Let's benchmark and measure perf.
Check open file handles.

@linsherpa thanks for chatting and opening this issue and the ticket.

I'll note that out of the box Dataverse only allows you unzip 1000 files at a time from a zip file: https://guides.dataverse.org/en/5.11.1/installation/config.html#multipleuploadfileslimit ... That's the most official statement I could find about how many files are supported in a single dataset... not a very strong one.

As you mentioned, the practical workaround is probably to double zip the files.

For developers I'll mention that scripts/search/data/binary/1000files.zip has 1000 small files we can test with.

Finally, here are some open issues related to large numbers of files in a dataset:

The text was updated successfully, but these errors were encountered:

mreekie · 2023-03-22T16:11:05Z

Sizing:

At today's sizing effort we established this issue as the umbrella issue for the larger problem that has been prioritized to fix.
All of the issues specifically called out in the description have had the label: "D: Dataset: large number of files" added to them.

Note:
In this case, this back log deliverable ("bklog: Deliverable") has a list of issue that are associated with it that existed prior to the deliverable being established. This situation is different than when the team identifies a big thing that needs to be broken donw actively as in breaking down an iceberg.

There's an added step needed:

Review the existing issues.
Create a description of what the actual scope of what fixing the underlying problem looks like
Describe that in place of or in addition to the very rough description that exists for this issue right now.

At that point this deliverable can then be treated like an iceberg and decomposed as a plan takes shape.

The original issues associated with this deliverable may or may not explicitly be part of that plan

Next Steps:

Create a spike to look into the underlying big picture and write it up in this issue description.

mreekie · 2023-03-27T18:46:00Z

Prio meeting:

Move this over to the "Dataverse Team (Gustavo)" column:

Note -These issues are already sized and are in the sprint ready column.

Investigate performance degradation in reindex of datasets with large numbers of files dataverse#8256
Performance: Slow response for the versions API call with large number of files or versions dataverse#9763
High levels of open file descriptors while uploading a zip with lots of files in it dataverse#6723
deleting datasets with large number of files <400 should result in success message dataverse#5523
Performance: Publishing dataset with large number of files via DataCite takes too long. dataverse#5283
spike: reproduce - Permissions: Grant access errors for large # of files dataverse#2641
Performance: Slow response for the versions API call with large number of files or versions dataverse#9763

The piece left is the description starting with: "We're talking about this issue in tech hours. Here are some pain points for users:..."

@scolapasta will take care of the next steps for this section

qqmyers · 2024-04-10T19:48:49Z

IQSS/dataverse#9683 is another related issue.

cmbz · 2024-04-10T19:49:17Z

2024/04/10

Closing as complete. We will open future issues as needed in context of SPA.

mreekie added the bklog: Deliverable label Mar 22, 2023

mreekie added this to IQSS Dataverse Project Mar 22, 2023

mreekie moved this to SPRINT- NEEDS SIZING in IQSS Dataverse Project Mar 22, 2023

mreekie mentioned this issue Mar 22, 2023

Performance: Slow response for the versions API call with large number of files or versions IQSS/dataverse#9763

Closed

mreekie added the D: Dataset: large number of files label Mar 22, 2023

mreekie moved this from SPRINT- NEEDS SIZING to Dataverse Team (Gustavo) in IQSS Dataverse Project Mar 27, 2023

cmbz moved this to SPRINT- NEEDS SIZING in IQSS Dataverse Project Mar 18, 2024

cmbz closed this as completed Apr 10, 2024

cmbz removed the status in IQSS Dataverse Project Apr 10, 2024

cmbz removed this from IQSS Dataverse Project Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deliverable: Slow response for datasets with high number of files #29

Deliverable: Slow response for datasets with high number of files #29

mreekie commented Mar 22, 2023 •

edited

Loading

mreekie commented Mar 22, 2023 •

edited

Loading

mreekie commented Mar 27, 2023 •

edited

Loading

qqmyers commented Apr 10, 2024

cmbz commented Apr 10, 2024

Deliverable: Slow response for datasets with high number of files #29

Deliverable: Slow response for datasets with high number of files #29

Comments

mreekie commented Mar 22, 2023 • edited Loading

mreekie commented Mar 22, 2023 • edited Loading

mreekie commented Mar 27, 2023 • edited Loading

qqmyers commented Apr 10, 2024

cmbz commented Apr 10, 2024

mreekie commented Mar 22, 2023 •

edited

Loading

mreekie commented Mar 22, 2023 •

edited

Loading

mreekie commented Mar 27, 2023 •

edited

Loading