Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Upload - allow files with same MD5 (or other checksum) in a dataset #4813

Closed
jggautier opened this issue Jul 5, 2018 · 29 comments · Fixed by #6924
Closed

File Upload - allow files with same MD5 (or other checksum) in a dataset #4813

jggautier opened this issue Jul 5, 2018 · 29 comments · Fixed by #6924

Comments

@jggautier
Copy link
Contributor

If someone uploads a file in a dataset that Dataverse notices already has a file with the same content (both files have the same MD5), Dataverse shows an error and doesn't allow the "duplicate" file to be uploaded.

3902509

Issues with this feature have been discussed in another github issue (#2955, closed when File Replace was released in Dataverse 4.6.1), in Dataverse's Google Group here and here, and in a recent Dataverse support ticket, where a depositor wrote that "for uploading shape files for two different polygons but the same projection, it might be nice to be able to upload both at the same time." For this researcher, a common workaround, uploading the file in different double-zipped archived files (7-Zip, tar file, etc) won't work because the journal policy doesn't allow depositors to upload archived files.

@oscardssmith
Copy link
Contributor

oscardssmith commented Jul 5, 2018

Is the issue here a hash collision or something else? Otherwise, wouldn't the different polygons have different hashes?

@jggautier
Copy link
Contributor Author

I'm not sure. The files in question for that support ticket are .prj files, defined here as:

an optional file that contains the metadata associated with the shapefiles coordinate and projection system.

So I think the content is the same, but they're meant for different polygons. (It looks like the journal allowed the author to upload the second file in a zip file.)

@djbrooke
Copy link
Contributor

  • Since we have file hierarchy now (and code deposit in the future), we should re-evaluate this.

@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2019

There's some related discussion in the "Add a checkbox to disable unzipping" issue at #3439 (comment) and below.

Also, as we think about pulling in files from GitHub (#2739 and #5372), we should consider that identical files are somewhat common. There could be identical .gitignore files, for example. It's also common in Python projects to have empty __init__.py files (which all have the same checksum, of course) all over: https://docs.python.org/3/tutorial/modules.html#packages . Geoconnect has a bunch of these:

Screen Shot 2019-07-18 at 1 30 14 PM

@shlake
Copy link
Contributor

shlake commented Aug 13, 2019

Here's another example of the need to have two files with same content, but different filenames (3D files at UVA):

I have 3 different 3D files that I need to upload. The files include dependencies, for example, the OBJ format includes a .obj (geometry), .jpg (texture) and .mtl (material file that connects the texture with the geometry). I have 2 different OBJ files I need to upload to my dataset. Each has a .jpg texture file that is the same file but named differently. This is necessary for the OBJ file to be able to connect to its specific texture file. DV won't let me upload both files because it thinks the other is the same file even though it is named differently. I need to have both files uploaded.

@pdurbin
Copy link
Member

pdurbin commented Aug 13, 2019

@shlake great real world, non-code example. Thanks!

@mheppler mheppler changed the title As a researcher, I need to publish a dataset that contains files with the same content, which are handled differently Duplicate Files - publish a dataset that contains files with the same content, which are handled differently Jan 16, 2020
@djbrooke djbrooke changed the title Duplicate Files - publish a dataset that contains files with the same content, which are handled differently Allow files with same MD5 Jan 24, 2020
@djbrooke djbrooke changed the title Allow files with same MD5 In a dataset, allow files with same MD5 Jan 24, 2020
@djbrooke
Copy link
Contributor

May make sense to discuss at the same time as #6574.

@scolapasta
Copy link
Contributor

Technically, this should be straightforward to remove this check. A couple of questions:
Should we remove completely or only if in different folder (i.e. with different paths)?

From the above seems like completely might fill all use cases, but then we do lose something.

In its place, should there be a warning in the UI? ("Note: this file has the same md5 as another file in this dataset")

For the API, we would either not warn or need to implement functionality similar to what we do for move Dataset where we return the warnings and require an extra parameter of "force=true". This seems more problematic with file upload, though, since we wouldn't want the user to have to re upload.

Another alternative for either UI or API could be to have this warning be on publish.

@shlake
Copy link
Contributor

shlake commented Feb 3, 2020

In UVa's case above, the "duplicate" is in the same directory. So I vote "no" to just check if in a different folder.

I like a "warning" message, versus a "stop - you can't do that" (and the file not get uploaded).

@mheppler
Copy link
Contributor

mheppler commented Feb 3, 2020

Definitely UI warning msg confirmation popup at time of upload. Similar to how we warn users in file replace workflow if the new file is a different type than the original, where we ask the user if they want to continue or not.

@djbrooke
Copy link
Contributor

djbrooke commented Feb 3, 2020

Thanks @mheppler for offering to include a mockup here so that we can bring it into a sprint soon.

@mheppler
Copy link
Contributor

mheppler commented Feb 4, 2020

File Type Different popup...

Screen Shot 2020-02-04 at 10 36 53 AM

Duplicate File popup...

Screen Shot 2020-02-04 at 10 38 44 AM

@mheppler mheppler removed their assignment Feb 4, 2020
@TaniaSchlatter
Copy link
Member

For the duplicate file, do we know that the file is a duplicate, or might it be a file with the same name?

@mheppler
Copy link
Contributor

mheppler commented Feb 4, 2020

If it's checking MD5 or other checksums (SHA, et al), it is the same file, contents and all.

@shlake
Copy link
Contributor

shlake commented Feb 4, 2020

@mheppler but a file with same content (same MD5) could have a different filename. So would there need to be a different popup for that? I see two types of duplicates: one with same filename & same content AND one with different filename & same content.

@TaniaSchlatter
Copy link
Member

If the user does not want to keep the file, at the time the popup is generated, is the system deleting the file, or canceling the upload/ingest?

sekmiller added a commit that referenced this issue Jul 14, 2020
sekmiller added a commit that referenced this issue Jul 16, 2020
sekmiller added a commit that referenced this issue Jul 16, 2020
sekmiller added a commit that referenced this issue Jul 16, 2020
sekmiller added a commit that referenced this issue Jul 17, 2020
sekmiller added a commit that referenced this issue Jul 20, 2020
sekmiller added a commit that referenced this issue Jul 28, 2020
sekmiller added a commit that referenced this issue Jul 28, 2020
@TaniaSchlatter
Copy link
Member

TaniaSchlatter commented Jul 29, 2020

For documentation and QA:
Document of use cases and messages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.