Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Upload: Detection of duplicate files via md5 found in an existing file is confusing to users. #2955

Closed
posixeleni opened this issue Feb 16, 2016 · 9 comments

Comments

@posixeleni
Copy link
Contributor

posixeleni commented Feb 16, 2016

User reported an issue uploading files where no duplicates appeared in the UI but they get an error saying "This file already exists in this dataset. Please upload another file."

See RT https://help.hmdc.harvard.edu/Ticket/Display.html?id=232612

@pdurbin
Copy link
Member

pdurbin commented Feb 22, 2016

https://help.hmdc.harvard.edu/Ticket/Display.html?id=232612 was marked resolved with "Our system looks at the MD5 checksum rather than the file name to determine duplicates."

I'm not sure what development work should be done here (or that this is actually a bug). Should the error explain how the check is performed? Passing back to @posixeleni for feedback.

@bencomp
Copy link
Contributor

bencomp commented Feb 25, 2016

One of our partners encountered this issue in our test environment. The error occurs when content (as determined by the checksum apparently) and timestamp are a match to an existing file, but to him it was unclear whether it should be an error to allow multiple files with the same content when the timestamps are different.

I would say that some clarity about "acceptible duplication" is welcome.

@bencomp
Copy link
Contributor

bencomp commented Mar 16, 2016

MD5 checksums are not considered sufficient for certain use cases, because it is too easy to create different files with the same checksum. Researchers might not be trying to do this, but you might want to know about it.

If Dataverse doesn't accept 'duplicate' files for storage reasons (e.g. because files on the file system are named using their checksum and you would get a name clash), you could try to link many DataFiles to a single object in storage instead.

@shlake
Copy link
Contributor

shlake commented Mar 25, 2016

This issue is exactly what I was describing and wondering about ("why") in the Dataverse User Community google group (on March 16th):

Restating my concern over duplicate file flagging (actually not just "flagging" but preventing upload).

I am thinking that I am not in the business of questioning why a researcher has "duplicate" files with different file names in their dataset. So is there any work around for dataverse to accept these files?

If a researcher has two files that just happen to contain the same information (the same checksum), I don't think that should stop that file from being uploaded, maybe flagged??. There may be a reason for different filenames w/ same content (such as: used in a script as part of analysis - where the title of the file is important to the script and thus would be important for transparency and understanding of the methodology).

Thanks for listening and welcome feedback and comments.

Edit: Here's the link to the post above: https://groups.google.com/d/msg/dataverse-community/FLnm8-60sOs/gNByf3c4CAAJ

@pdurbin
Copy link
Member

pdurbin commented Mar 28, 2016

Duplicate file detection is a feature that was developed in #357 and if it was designed in a suboptimal way it might be worth looking at that issue for the reasoning behind the feature.

@kcondon kcondon changed the title Unable to Upload Non-Duplicate Files File Upload: Detection of duplicate files via md5 found in an existing file is confusing to users. Apr 19, 2016
@kcondon kcondon added Type: Suggestion an idea and removed Type: Bug a defect labels Apr 19, 2016
@kcondon
Copy link
Contributor

kcondon commented Apr 19, 2016

Made this a suggestion since it works as expected but confusing to users. We may revisit how we detect and communicate this.

@pdurbin
Copy link
Member

pdurbin commented Jan 13, 2017

See also #3571.

@pdurbin
Copy link
Member

pdurbin commented Jan 18, 2017

I agree with @landreev 's comment at #3571 (comment) that we should move toward closing this issue so I'm parking it in Development until the pull request for #2290 is made so we can associated this issue with it.

@kcondon
Copy link
Contributor

kcondon commented Jan 26, 2017

Much improved, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants