Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Publishing dataset with large number of files via DataCite takes too long. #5283

Closed
kcondon opened this issue Nov 6, 2018 · 7 comments
Labels
D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27 Feature: Performance & Stability

Comments

@kcondon
Copy link
Contributor

kcondon commented Nov 6, 2018

User in production had 100 datasets, each with 500 files. They had problems publishing them , eventually ending up with permanently locked datasets, likely due to server restart somewhere during the process.

This became a support issue, rt 268791 where a script needed to be written to delete locks. While benchmarking publishing these datasets, observed a single dataset takes 24mins, 2 datasets takes roughly 2x longer, 4 datasets 4x+ longer. It seems performance degrades linearly with number of pids.

This is a great use case because all datasets are same size.

Questions:

  1. Why is performance for a single 500 file dataset so poor against DataCite (6x EZID)?
  2. Why does performance for concurrent publishing n take n times longer?

Performance for DataCite was improved in a recent community pr by @qqmyers and while a big improvement, would likely not address this level.

@qqmyers
Copy link
Member

qqmyers commented Nov 6, 2018

@kcondon - why "would likely not address this level"? The duplicate calls were a factor of 2 and keeping a single connection rather than renegotiating ssl was also significant. Are you thinking there's another factor of ~2? Or is there something else going on?

FWIW: I just tried a test publication (10.5072) with 1000 files on our small dev machine and it took ~11 minutes, almost all of it in the file for loop as expected. If it's linear, that's around 5 minutes for 500 which would be close to what you saw for EZID. (The time is all in the idServiceBean.publicizeIdentifier(df) calls, but I have not checked to see whether the time is in the https calls or in generating the datacite.xml file to send... if it's the latter, it might help explain the concurrent performance.).

@kcondon
Copy link
Contributor Author

kcondon commented Nov 7, 2018

@qqmyers
I need to make more careful measurements to be precise but I was first responding to the production performance issue, see the two issues, and recognizing your contribution to performance but was not sure whether it got at all the issues, based on my rough recollection of ~40% improvement. Your performance improvement is significant and will help a lot. I was a bit worried that concurrent publishing is directly proportional and appears that concurrent registration does not result in higher throughput. That means one user publishing 100 datasets would clog up the works for a user with a single dataset to publish. I was trying to understand that aspect mostly but also the 6x performance hit seemed fairly outsized and was not sure your changes addressed that part in full.

Does that clarify my thinking or at least my response? I will do a head to head comparison with 500 files on both EZID and DataCite with single and multiple datasets. It just takes time to do that and this was a side issue I encountered as I'm sure you can relate to.

@qqmyers
Copy link
Member

qqmyers commented Nov 7, 2018

@kcondon -no worries. FWIW, I just did a quick test and the two calls to DataCite (postMetadata, postUrl) are taking ~380-600 ms, averaging maybe 450ms with ~100-200ms in between calls being sent (the time to process a file, create the metadata doc, etc.).

@qqmyers
Copy link
Member

qqmyers commented Nov 12, 2018

@kcondon - one other thought. The apache libraries have default limits on the number of connections (overall and to a given host) - trying more than that number will just queue them up, which might explain the slow down with multiple publications...

@kcondon
Copy link
Contributor Author

kcondon commented Nov 13, 2018

@qqmyers Thanks, I will look into it. It sounds promising.

@kcondon kcondon changed the title Publish Dataset: Publishing dataset with large number of files via DataCite takes too long. Performance: Publishing dataset with large number of files via DataCite takes too long. Sep 15, 2021
@mreekie mreekie moved this to ▶SPRINT- NEEDS SIZING in IQSS Dataverse Project Feb 13, 2023
@mreekie mreekie added the D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27 label Feb 13, 2023
@mreekie
Copy link

mreekie commented Mar 14, 2023

Sizing:

Next steps:

@mreekie
Copy link

mreekie commented Mar 14, 2023

Closing this issue - but this is referenced for testing in #9272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27 Feature: Performance & Stability
Projects
Status: No status
Development

No branches or pull requests

3 participants