Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deliverable: Slow response for datasets with high number of files #29

Closed
mreekie opened this issue Mar 22, 2023 · 4 comments
Closed

Deliverable: Slow response for datasets with high number of files #29

mreekie opened this issue Mar 22, 2023 · 4 comments

Comments

@mreekie
Copy link
Collaborator

mreekie commented Mar 22, 2023

This is an umbrella to be used for all the issues we are seeing related to the title.
What follows is a rough cut of what's been discussed thus far that would need to be done to complete this deliverable.

As a "bklog: Deliverable" This is decomposed into smaller issues.

  • Each of the smaller issues gets the label "D: Dataset: large number of files"
  • This issue, the only issue to have both labels ("bklog: Deliverable", "D: Dataset: large number of files"), will stay in the Dataverse_Global_Backlog project forever.
  • It will stay in it's present column until the team feels like the issue has no smaller issues that need need to be broken off in order to resolve this issue
  • At that point, this issue stays in the Dataverse_Global_backlog, but changes it's status to: "Clear of the Backlog"

We're talking about this issue in tech hours. Here are some pain points for users:

  • Editing the title (or other piece) when there are many files (30,000). The save is prohibitively expensive. Affects depositors. Maybe removing the cascade will help. There are two cascades. We could write tests with the 1000files.zip file vs 1 file. How long to edit the title?
  • Slow indexing of a dataset with 30,000 files. Affects sysadmins.
  • Only 20 files but many versions. Slow to make the next version. Multiplying affect. Affects depositors. Reindexing is slow?
  • ...

Other discussion:

  • Does the new zip previewer/downloader help?
  • Creating a large JSON for the tree view?
  • Let's benchmark and measure perf.
  • Check open file handles.

@linsherpa thanks for chatting and opening this issue and the ticket.

I'll note that out of the box Dataverse only allows you unzip 1000 files at a time from a zip file: https://guides.dataverse.org/en/5.11.1/installation/config.html#multipleuploadfileslimit ... That's the most official statement I could find about how many files are supported in a single dataset... not a very strong one.

As you mentioned, the practical workaround is probably to double zip the files.

For developers I'll mention that scripts/search/data/binary/1000files.zip has 1000 small files we can test with.

Finally, here are some open issues related to large numbers of files in a dataset:

@mreekie
Copy link
Collaborator Author

mreekie commented Mar 22, 2023

Sizing:

  • At today's sizing effort we established this issue as the umbrella issue for the larger problem that has been prioritized to fix.
  • All of the issues specifically called out in the description have had the label: "D: Dataset: large number of files" added to them.

Note:
In this case, this back log deliverable ("bklog: Deliverable") has a list of issue that are associated with it that existed prior to the deliverable being established. This situation is different than when the team identifies a big thing that needs to be broken donw actively as in breaking down an iceberg.

There's an added step needed:

  • Review the existing issues.
  • Create a description of what the actual scope of what fixing the underlying problem looks like
  • Describe that in place of or in addition to the very rough description that exists for this issue right now.

At that point this deliverable can then be treated like an iceberg and decomposed as a plan takes shape.

  • The original issues associated with this deliverable may or may not explicitly be part of that plan

Next Steps:

  • Create a spike to look into the underlying big picture and write it up in this issue description.

@mreekie mreekie moved this from SPRINT- NEEDS SIZING to Dataverse Team (Gustavo) in IQSS Dataverse Project Mar 27, 2023
@cmbz cmbz moved this to SPRINT- NEEDS SIZING in IQSS Dataverse Project Mar 18, 2024
@qqmyers
Copy link
Member

qqmyers commented Apr 10, 2024

IQSS/dataverse#9683 is another related issue.

@cmbz
Copy link
Contributor

cmbz commented Apr 10, 2024

2024/04/10

  • Closing as complete. We will open future issues as needed in context of SPA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants