Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Upload - Allow files of the same name within different directories of a dataset #6574

Closed
djbrooke opened this issue Jan 24, 2020 · 12 comments · Fixed by #6893
Closed

File Upload - Allow files of the same name within different directories of a dataset #6574

djbrooke opened this issue Jan 24, 2020 · 12 comments · Fixed by #6893

Comments

@djbrooke
Copy link
Contributor

djbrooke commented Jan 24, 2020

Within in a dataset, we currently add "-1" to any files uploaded that have the same name. We'll need to change this as we integrate with more reproducibility tools and other systems, where scripts, code, etc/ are expecting the names of files to not change.

Needs some thought, and may make sense to discuss at the same time as #4813.


Status (by @pdurbin). When "fixed" is indicated, it means in the branch below.

  • Bugs described below by @mheppler have been reproduced, specifically:
    • From the "edit metadata for files" page, you can give files the same name. Fixed.
    • If you delete some files while uploading, the counting is not reset so you'll get "-3" when it could have been "-2", for example.
  • Bugs in the API have been found:
    • It was possible to give rename files to have the same name. Fixed.
    • It was possible to "move" files to directory with a file that has the same name, creating a duplicate. Fixed.
  • API tests have been written:
    • uploadTwoFilesWithSameNameSameDirectory
    • uploadTwoFilesWithSameNameDifferentDirectories
    • renameFileToSameName
    • moveFileToDirectoryContainingSameFileName

6574-filenames is the branch where fixes are being pushed. Here's the "compare" view: https://github.com/IQSS/dataverse/compare/6574-filenames

@djbrooke djbrooke changed the title Allow files of the same name within a dataset Allow files of the same name within different directories of a dataset Jan 24, 2020
@djbrooke
Copy link
Contributor Author

Updated the title based on a quick discussion with @jggautier.

@scolapasta
Copy link
Contributor

This should be straightforward in that our name checking should be based not just on the name solely but on the path + name.

@djbrooke if path + name are the same, is it still appropriate to add the 1? (in any real upload, there can't be too files with the same name in the same directory)

@djbrooke
Copy link
Contributor Author

djbrooke commented Feb 3, 2020

@scolapasta makes sense that we should still add the -1 in cases of the same directory. Thanks. Moving to Ready.

@mheppler mheppler changed the title Allow files of the same name within different directories of a dataset File Upload - Allow files of the same name within different directories of a dataset Feb 4, 2020
@djbrooke djbrooke self-assigned this Feb 5, 2020
@djbrooke
Copy link
Contributor Author

djbrooke commented Feb 5, 2020

  • We should try this out to see the current behavior (this may be fixed?). Thanks @mheppler for volunteering
  • We should keep in mind that we want to create empty directories in the future (link issue)
  • In the previous implementation, we would append -1 in the file name of the upload widget, we just want to not do this for files in the same dataset directory
  • There may be ripple effects in other places (permissions page, search)

@djbrooke djbrooke added the Medium label Feb 5, 2020
@djbrooke djbrooke removed their assignment Feb 5, 2020
@mheppler mheppler self-assigned this Feb 5, 2020
@mheppler
Copy link
Contributor

mheppler commented Feb 5, 2020

In an attempt to document the current state of affairs, I created a new draft dataset on demo, TEST ZIP UPLOADS, and updated the example Code Ocean capsule which I exported as a ZIP. Here is what I've seen so far. I will test a little further and review with the team to make sure I am covering all our reproducibility use cases.

For the record, spoiler alert, I haven't seen any "-1" added to the file names yet.

I wanted to document one funny behavior. When uploading three README files, with three different names, but README.md and README copy.md are duplicate MD5's, the system accepted "README copy.md" but not "README.md". Presumably this was because it accpeted the file it tried to load first, and the "copy" version display first when sorting the files by name.

Screen Shot 2020-02-05 at 3 59 02 PM

Screen Shot 2020-02-05 at 3 59 30 PM

@qqmyers
Copy link
Member

qqmyers commented Feb 5, 2020

FWIW: An odd corner case - zip files can contain two files with the same path/dir. They won't if they are created from zipping up a file system, but one can use a zip library to do so (and having a source repository that allows duplicate names), so there is the potential to receive a zip that does this. (When I've unzipped one of these files, it does as Dataverse does and adds a -1 to the filename.)

@landreev
Copy link
Contributor

landreev commented Feb 5, 2020

@mheppler Not sure what you did or how you tested, but the "-1" suffix does get added when I try it:
Screen Shot 2020-02-05 at 4 40 56 PM

@landreev
Copy link
Contributor

landreev commented Feb 5, 2020

@mheppler - in your example above, it does look like the upload accepted 2 files - "README copy.md" and "README fake.md" with the md5s that are already in the dataset (you may be right about it having something to do with it only dropping the first checksum duplicate it finds - it may be buggy).
But what happens when you try to click save? Does it allow you to then save these 2 files, or rejects them?

@mheppler
Copy link
Contributor

mheppler commented Feb 7, 2020

Here is what I found when looking into the inconsistent upload behavior for two different files both named README.md, but in two different directories.

DRAG + DROP EACH FILE ... ADDS "-1"

  • README.md
  • README-1.md

DRAG + DROP ZIP ... NO CHANGE TO FILE NAME

  • folder2/README.md
  • folder1/README.md

Then found a bug where deleting files during upload does not reset the duplicate file name counter.

DRAG + DROP EACH FILE ... ADDS "-1" ... DELETE FILES + REPEAT UPLOAD

  • README-2
  • README-3

Then I tried to replicate the use case from @qqmyers. Without the use of a ZIP library I attempted to use a workaround that isn't the exact scenario but may be close. I created two different ZIPs of the same name, containing two different README.md files, and the system again added "-1" as Jim reported.

DRAG + DROP TWO ZIP WITH SAME PATHS ... ADDS "-1"

  • folder1/README.md
  • folder1/README-1.md

I'll review these results with @scolapasta and @landreev, but I think the system is handling each of these uses as we'd expected.

And here is another fun inconsistent use case discovered during tech hours discussions. This perfectly sums up the inconsistencies in this workflow.

DRAG + DROP EACH FILE ... ADDS "-1" ... SAVE... SUCCESS... EDIT TO REMOVE "-1"... SAVE... SUCCESS?!

  • README.md
  • README.md

@scolapasta
Copy link
Contributor

Code for checking filename should be centralized so that all use cases call it - whether through UI, native API, or Sword, and considering zip vs individual files.

@scolapasta
Copy link
Contributor

A couple of extra notes from the developers:
• we need to make sure to consider both upload and edit
• should we also add a warning here that let's the user know we changed the name of the file since we found a file with the same name? @mheppler

@pdurbin pdurbin self-assigned this Apr 22, 2020
@scolapasta
Copy link
Contributor

From discussion today to add some clarity:

When we are referring to unique name, we are specifically referring to the same "path/name" combination; users should be able to upload two files with the same file name in different directories.

The logic for the API and UI should be the same (and therefore the code to determine this test centralized, as mentioned in an above comment). If not path/name is not unique, we add a -1 (or -2, etc).

In the UI the user will see the change before save.

In the API, we accept the change (as it is easily reversible) but add the warning to the response (this same warning could be added to the UI to further highlight the change, if desired).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants