-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File Upload: Detection of duplicate files via md5 found in an existing file is confusing to users. #2955
Comments
https://help.hmdc.harvard.edu/Ticket/Display.html?id=232612 was marked resolved with "Our system looks at the MD5 checksum rather than the file name to determine duplicates." I'm not sure what development work should be done here (or that this is actually a bug). Should the error explain how the check is performed? Passing back to @posixeleni for feedback. |
One of our partners encountered this issue in our test environment. The error occurs when content (as determined by the checksum apparently) and timestamp are a match to an existing file, but to him it was unclear whether it should be an error to allow multiple files with the same content when the timestamps are different. I would say that some clarity about "acceptible duplication" is welcome. |
MD5 checksums are not considered sufficient for certain use cases, because it is too easy to create different files with the same checksum. Researchers might not be trying to do this, but you might want to know about it. If Dataverse doesn't accept 'duplicate' files for storage reasons (e.g. because files on the file system are named using their checksum and you would get a name clash), you could try to link many |
This issue is exactly what I was describing and wondering about ("why") in the Dataverse User Community google group (on March 16th):
Edit: Here's the link to the post above: https://groups.google.com/d/msg/dataverse-community/FLnm8-60sOs/gNByf3c4CAAJ |
Duplicate file detection is a feature that was developed in #357 and if it was designed in a suboptimal way it might be worth looking at that issue for the reasoning behind the feature. |
Made this a suggestion since it works as expected but confusing to users. We may revisit how we detect and communicate this. |
See also #3571. |
I agree with @landreev 's comment at #3571 (comment) that we should move toward closing this issue so I'm parking it in Development until the pull request for #2290 is made so we can associated this issue with it. |
Much improved, closing. |
User reported an issue uploading files where no duplicates appeared in the UI but they get an error saying "This file already exists in this dataset. Please upload another file."
See RT https://help.hmdc.harvard.edu/Ticket/Display.html?id=232612
The text was updated successfully, but these errors were encountered: