-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File Upload - Allow files of the same name within different directories of a dataset #6574
Comments
Updated the title based on a quick discussion with @jggautier. |
This should be straightforward in that our name checking should be based not just on the name solely but on the path + name. @djbrooke if path + name are the same, is it still appropriate to add the 1? (in any real upload, there can't be too files with the same name in the same directory) |
@scolapasta makes sense that we should still add the -1 in cases of the same directory. Thanks. Moving to Ready. |
|
In an attempt to document the current state of affairs, I created a new draft dataset on demo, TEST ZIP UPLOADS, and updated the example Code Ocean capsule which I exported as a ZIP.
I wanted to document one funny behavior. When uploading three README files, with three different names, but README.md and README copy.md are duplicate MD5's, the system accepted "README copy.md" but not "README.md". Presumably this was because it accpeted the file it tried to load first, and the "copy" version display first when sorting the files by name. |
FWIW: An odd corner case - zip files can contain two files with the same path/dir. They won't if they are created from zipping up a file system, but one can use a zip library to do so (and having a source repository that allows duplicate names), so there is the potential to receive a zip that does this. (When I've unzipped one of these files, it does as Dataverse does and adds a -1 to the filename.) |
@mheppler Not sure what you did or how you tested, but the "-1" suffix does get added when I try it: |
@mheppler - in your example above, it does look like the upload accepted 2 files - "README copy.md" and "README fake.md" with the md5s that are already in the dataset (you may be right about it having something to do with it only dropping the first checksum duplicate it finds - it may be buggy). |
Here is what I found when looking into the inconsistent upload behavior for two different files both named README.md, but in two different directories. DRAG + DROP EACH FILE ... ADDS "-1"
DRAG + DROP ZIP ... NO CHANGE TO FILE NAME
Then found a bug where deleting files during upload does not reset the duplicate file name counter. DRAG + DROP EACH FILE ... ADDS "-1" ... DELETE FILES + REPEAT UPLOAD
Then I tried to replicate the use case from @qqmyers. Without the use of a ZIP library I attempted to use a workaround that isn't the exact scenario but may be close. I created two different ZIPs of the same name, containing two different README.md files, and the system again added "-1" as Jim reported. DRAG + DROP TWO ZIP WITH SAME PATHS ... ADDS "-1"
I'll review these results with @scolapasta and @landreev, but I think the system is handling each of these uses as we'd expected. And here is another fun inconsistent use case discovered during tech hours discussions. This perfectly sums up the inconsistencies in this workflow. DRAG + DROP EACH FILE ... ADDS "-1" ... SAVE... SUCCESS... EDIT TO REMOVE "-1"... SAVE... SUCCESS?!
|
Code for checking filename should be centralized so that all use cases call it - whether through UI, native API, or Sword, and considering zip vs individual files. |
A couple of extra notes from the developers: |
From discussion today to add some clarity: When we are referring to unique name, we are specifically referring to the same "path/name" combination; users should be able to upload two files with the same file name in different directories. The logic for the API and UI should be the same (and therefore the code to determine this test centralized, as mentioned in an above comment). If not path/name is not unique, we add a -1 (or -2, etc). In the UI the user will see the change before save. In the API, we accept the change (as it is easily reversible) but add the warning to the response (this same warning could be added to the UI to further highlight the change, if desired). |
Within in a dataset, we currently add "-1" to any files uploaded that have the same name. We'll need to change this as we integrate with more reproducibility tools and other systems, where scripts, code, etc/ are expecting the names of files to not change.
Needs some thought, and may make sense to discuss at the same time as #4813.
Status (by @pdurbin). When "fixed" is indicated, it means in the branch below.
6574-filenames
is the branch where fixes are being pushed. Here's the "compare" view: https://github.com/IQSS/dataverse/compare/6574-filenamesThe text was updated successfully, but these errors were encountered: