Performance: Publishing dataset with large number of files via DataCite takes too long. #5283

kcondon · 2018-11-06T21:01:38Z

User in production had 100 datasets, each with 500 files. They had problems publishing them , eventually ending up with permanently locked datasets, likely due to server restart somewhere during the process.

This became a support issue, rt 268791 where a script needed to be written to delete locks. While benchmarking publishing these datasets, observed a single dataset takes 24mins, 2 datasets takes roughly 2x longer, 4 datasets 4x+ longer. It seems performance degrades linearly with number of pids.

This is a great use case because all datasets are same size.

Questions:

Why is performance for a single 500 file dataset so poor against DataCite (6x EZID)?
Why does performance for concurrent publishing n take n times longer?

Performance for DataCite was improved in a recent community pr by @qqmyers and while a big improvement, would likely not address this level.

qqmyers · 2018-11-06T23:45:05Z

@kcondon - why "would likely not address this level"? The duplicate calls were a factor of 2 and keeping a single connection rather than renegotiating ssl was also significant. Are you thinking there's another factor of ~2? Or is there something else going on?

FWIW: I just tried a test publication (10.5072) with 1000 files on our small dev machine and it took ~11 minutes, almost all of it in the file for loop as expected. If it's linear, that's around 5 minutes for 500 which would be close to what you saw for EZID. (The time is all in the idServiceBean.publicizeIdentifier(df) calls, but I have not checked to see whether the time is in the https calls or in generating the datacite.xml file to send... if it's the latter, it might help explain the concurrent performance.).

kcondon · 2018-11-07T15:23:06Z

@qqmyers
I need to make more careful measurements to be precise but I was first responding to the production performance issue, see the two issues, and recognizing your contribution to performance but was not sure whether it got at all the issues, based on my rough recollection of ~40% improvement. Your performance improvement is significant and will help a lot. I was a bit worried that concurrent publishing is directly proportional and appears that concurrent registration does not result in higher throughput. That means one user publishing 100 datasets would clog up the works for a user with a single dataset to publish. I was trying to understand that aspect mostly but also the 6x performance hit seemed fairly outsized and was not sure your changes addressed that part in full.

Does that clarify my thinking or at least my response? I will do a head to head comparison with 500 files on both EZID and DataCite with single and multiple datasets. It just takes time to do that and this was a side issue I encountered as I'm sure you can relate to.

qqmyers · 2018-11-07T16:00:40Z

@kcondon -no worries. FWIW, I just did a quick test and the two calls to DataCite (postMetadata, postUrl) are taking ~380-600 ms, averaging maybe 450ms with ~100-200ms in between calls being sent (the time to process a file, create the metadata doc, etc.).

qqmyers · 2018-11-12T16:20:46Z

@kcondon - one other thought. The apache libraries have default limits on the number of connections (overall and to a given host) - trying more than that number will just queue them up, which might explain the slow down with multiple publications...

kcondon · 2018-11-13T16:47:54Z

@qqmyers Thanks, I will look into it. It sounds promising.

mreekie · 2023-03-14T15:53:45Z

Sizing:

Next steps:

The problem in this issue is customer facing and needs to be addressed.
The work on this is addressed by Spike: Datacite request to avoid creating unecessary file DOIs #9272
Ensure that this issue is referenced in that issue so that the problem expressed here is adressed in testing

mreekie · 2023-03-14T15:55:35Z

Closing this issue - but this is referenced for testing in #9272

kcondon changed the title ~~Publish Dataset: Publishing dataset with large number of files via DataCite takes too long.~~ Performance: Publishing dataset with large number of files via DataCite takes too long. Sep 15, 2021

kcondon added the Feature: Performance & Stability label Sep 15, 2021

pdurbin mentioned this issue Aug 23, 2022

Dataset with large number of files #8928

Closed

mreekie added this to IQSS Dataverse Project Feb 13, 2023

mreekie moved this to ▶SPRINT- NEEDS SIZING in IQSS Dataverse Project Feb 13, 2023

mreekie added the D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27 label Feb 13, 2023

mreekie mentioned this issue Mar 14, 2023

Spike: Datacite request to avoid creating unecessary file DOIs #9272

Closed

mreekie closed this as completed Mar 14, 2023

mreekie mentioned this issue Mar 22, 2023

Deliverable: Slow response for datasets with high number of files IQSS/dataverse-pm#29

Closed

cmbz moved this from SPRINT- NEEDS SIZING to Clear of the Backlog in IQSS Dataverse Project May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Publishing dataset with large number of files via DataCite takes too long. #5283

Performance: Publishing dataset with large number of files via DataCite takes too long. #5283

kcondon commented Nov 6, 2018 •

edited

Loading

qqmyers commented Nov 6, 2018

kcondon commented Nov 7, 2018

qqmyers commented Nov 7, 2018

qqmyers commented Nov 12, 2018

kcondon commented Nov 13, 2018

mreekie commented Mar 14, 2023

mreekie commented Mar 14, 2023

Performance: Publishing dataset with large number of files via DataCite takes too long. #5283

Performance: Publishing dataset with large number of files via DataCite takes too long. #5283

Comments

kcondon commented Nov 6, 2018 • edited Loading

qqmyers commented Nov 6, 2018

kcondon commented Nov 7, 2018

qqmyers commented Nov 7, 2018

qqmyers commented Nov 12, 2018

kcondon commented Nov 13, 2018

mreekie commented Mar 14, 2023

mreekie commented Mar 14, 2023

kcondon commented Nov 6, 2018 •

edited

Loading