-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Publishing dataset with large number of files via DataCite takes too long. #5283
Comments
@kcondon - why "would likely not address this level"? The duplicate calls were a factor of 2 and keeping a single connection rather than renegotiating ssl was also significant. Are you thinking there's another factor of ~2? Or is there something else going on? FWIW: I just tried a test publication (10.5072) with 1000 files on our small dev machine and it took ~11 minutes, almost all of it in the file for loop as expected. If it's linear, that's around 5 minutes for 500 which would be close to what you saw for EZID. (The time is all in the idServiceBean.publicizeIdentifier(df) calls, but I have not checked to see whether the time is in the https calls or in generating the datacite.xml file to send... if it's the latter, it might help explain the concurrent performance.). |
@qqmyers Does that clarify my thinking or at least my response? I will do a head to head comparison with 500 files on both EZID and DataCite with single and multiple datasets. It just takes time to do that and this was a side issue I encountered as I'm sure you can relate to. |
@kcondon -no worries. FWIW, I just did a quick test and the two calls to DataCite (postMetadata, postUrl) are taking ~380-600 ms, averaging maybe 450ms with ~100-200ms in between calls being sent (the time to process a file, create the metadata doc, etc.). |
@kcondon - one other thought. The apache libraries have default limits on the number of connections (overall and to a given host) - trying more than that number will just queue them up, which might explain the slow down with multiple publications... |
@qqmyers Thanks, I will look into it. It sounds promising. |
Sizing: Next steps:
|
Closing this issue - but this is referenced for testing in #9272 |
User in production had 100 datasets, each with 500 files. They had problems publishing them , eventually ending up with permanently locked datasets, likely due to server restart somewhere during the process.
This became a support issue, rt 268791 where a script needed to be written to delete locks. While benchmarking publishing these datasets, observed a single dataset takes 24mins, 2 datasets takes roughly 2x longer, 4 datasets 4x+ longer. It seems performance degrades linearly with number of pids.
This is a great use case because all datasets are same size.
Questions:
Performance for DataCite was improved in a recent community pr by @qqmyers and while a big improvement, would likely not address this level.
The text was updated successfully, but these errors were encountered: