File Upload - Allow files of the same name within different directories of a dataset #6574

djbrooke · 2020-01-24T16:10:02Z

Within in a dataset, we currently add "-1" to any files uploaded that have the same name. We'll need to change this as we integrate with more reproducibility tools and other systems, where scripts, code, etc/ are expecting the names of files to not change.

Needs some thought, and may make sense to discuss at the same time as #4813.

Status (by @pdurbin). When "fixed" is indicated, it means in the branch below.

Bugs described below by @mheppler have been reproduced, specifically:
- From the "edit metadata for files" page, you can give files the same name. Fixed.
- If you delete some files while uploading, the counting is not reset so you'll get "-3" when it could have been "-2", for example.
Bugs in the API have been found:
- It was possible to give rename files to have the same name. Fixed.
- It was possible to "move" files to directory with a file that has the same name, creating a duplicate. Fixed.
API tests have been written:
- uploadTwoFilesWithSameNameSameDirectory
- uploadTwoFilesWithSameNameDifferentDirectories
- renameFileToSameName
- moveFileToDirectoryContainingSameFileName

6574-filenames is the branch where fixes are being pushed. Here's the "compare" view: https://github.com/IQSS/dataverse/compare/6574-filenames

The text was updated successfully, but these errors were encountered:

djbrooke · 2020-01-24T16:30:29Z

Updated the title based on a quick discussion with @jggautier.

scolapasta · 2020-02-03T15:11:10Z

This should be straightforward in that our name checking should be based not just on the name solely but on the path + name.

@djbrooke if path + name are the same, is it still appropriate to add the 1? (in any real upload, there can't be too files with the same name in the same directory)

djbrooke · 2020-02-03T16:08:35Z

@scolapasta makes sense that we should still add the -1 in cases of the same directory. Thanks. Moving to Ready.

djbrooke · 2020-02-05T19:55:36Z

We should try this out to see the current behavior (this may be fixed?). Thanks @mheppler for volunteering
We should keep in mind that we want to create empty directories in the future (link issue)
In the previous implementation, we would append -1 in the file name of the upload widget, we just want to not do this for files in the same dataset directory
There may be ripple effects in other places (permissions page, search)

mheppler · 2020-02-05T21:01:48Z

In an attempt to document the current state of affairs, I created a new draft dataset on demo, TEST ZIP UPLOADS, and updated the example Code Ocean capsule which I exported as a ZIP. ~~Here is what I've seen so far.~~ I will test a little further and review with the team to make sure I am covering all our reproducibility use cases.

~~For the record, spoiler alert, I haven't seen any "-1" added to the file names yet.~~

I wanted to document one funny behavior. When uploading three README files, with three different names, but README.md and README copy.md are duplicate MD5's, the system accepted "README copy.md" but not "README.md". Presumably this was because it accpeted the file it tried to load first, and the "copy" version display first when sorting the files by name.

qqmyers · 2020-02-05T21:38:05Z

FWIW: An odd corner case - zip files can contain two files with the same path/dir. They won't if they are created from zipping up a file system, but one can use a zip library to do so (and having a source repository that allows duplicate names), so there is the potential to receive a zip that does this. (When I've unzipped one of these files, it does as Dataverse does and adds a -1 to the filename.)

landreev · 2020-02-05T21:42:50Z

@mheppler Not sure what you did or how you tested, but the "-1" suffix does get added when I try it:

landreev · 2020-02-05T21:50:26Z

@mheppler - in your example above, it does look like the upload accepted 2 files - "README copy.md" and "README fake.md" with the md5s that are already in the dataset (you may be right about it having something to do with it only dropping the first checksum duplicate it finds - it may be buggy).
But what happens when you try to click save? Does it allow you to then save these 2 files, or rejects them?

mheppler · 2020-02-07T17:06:33Z

Here is what I found when looking into the inconsistent upload behavior for two different files both named README.md, but in two different directories.

DRAG + DROP EACH FILE ... ADDS "-1"

README.md
README-1.md

DRAG + DROP ZIP ... NO CHANGE TO FILE NAME

folder2/README.md
folder1/README.md

Then found a bug where deleting files during upload does not reset the duplicate file name counter.

DRAG + DROP EACH FILE ... ADDS "-1" ... DELETE FILES + REPEAT UPLOAD

README-2
README-3

Then I tried to replicate the use case from @qqmyers. Without the use of a ZIP library I attempted to use a workaround that isn't the exact scenario but may be close. I created two different ZIPs of the same name, containing two different README.md files, and the system again added "-1" as Jim reported.

DRAG + DROP TWO ZIP WITH SAME PATHS ... ADDS "-1"

folder1/README.md
folder1/README-1.md

I'll review these results with @scolapasta and @landreev, but I think the system is handling each of these uses as we'd expected.

And here is another fun inconsistent use case discovered during tech hours discussions. This perfectly sums up the inconsistencies in this workflow.

DRAG + DROP EACH FILE ... ADDS "-1" ... SAVE... SUCCESS... EDIT TO REMOVE "-1"... SAVE... SUCCESS?!

README.md
README.md

scolapasta · 2020-03-05T20:45:58Z

Code for checking filename should be centralized so that all use cases call it - whether through UI, native API, or Sword, and considering zip vs individual files.

scolapasta · 2020-03-05T20:52:24Z

A couple of extra notes from the developers:
• we need to make sure to consider both upload and edit
• should we also add a warning here that let's the user know we changed the name of the file since we found a file with the same name? @mheppler

scolapasta · 2020-04-28T17:44:36Z

From discussion today to add some clarity:

When we are referring to unique name, we are specifically referring to the same "path/name" combination; users should be able to upload two files with the same file name in different directories.

The logic for the API and UI should be the same (and therefore the code to determine this test centralized, as mentioned in an above comment). If not path/name is not unique, we add a -1 (or -2, etc).

In the UI the user will see the change before save.

In the API, we accept the change (as it is easily reversible) but add the warning to the response (this same warning could be added to the UI to further highlight the change, if desired).

djbrooke mentioned this issue Jan 24, 2020

File Upload - allow files with same MD5 (or other checksum) in a dataset #4813

Closed

djbrooke changed the title ~~Allow files of the same name within a dataset~~ Allow files of the same name within different directories of a dataset Jan 24, 2020

mheppler added the Feature: File Upload & Handling label Feb 4, 2020

mheppler changed the title ~~Allow files of the same name within different directories of a dataset~~ File Upload - Allow files of the same name within different directories of a dataset Feb 4, 2020

djbrooke removed the Feature: File Upload & Handling label Feb 5, 2020

djbrooke self-assigned this Feb 5, 2020

djbrooke added the Medium label Feb 5, 2020

djbrooke removed their assignment Feb 5, 2020

mheppler self-assigned this Feb 5, 2020

djbrooke unassigned mheppler Feb 12, 2020

mheppler added the Feature: File Upload & Handling label Feb 12, 2020

mheppler mentioned this issue Feb 13, 2020

Cannot upload files without extensions #6557

Closed

pdurbin self-assigned this Apr 22, 2020

pdurbin added a commit that referenced this issue Apr 29, 2020

prevent filename conflicts via file metadata API #6574

41ab1c4

pdurbin added a commit that referenced this issue Apr 29, 2020

add test: files not renamed in diff dirs on upload #6574

15d18dd

pdurbin added a commit that referenced this issue Apr 30, 2020

centralize duplicate file check with related methods #6574

c3e9275

pdurbin added a commit that referenced this issue May 1, 2020

prevent filename duplicates from "edit metadata for files" page #6574

35d07a4

pdurbin added a commit that referenced this issue May 4, 2020

Merge branch 'develop' into 6574-filenames #6574

6ef84f8

pdurbin mentioned this issue May 4, 2020

6574 filenames #6893

Merged

pdurbin removed their assignment May 4, 2020

pdurbin added a commit that referenced this issue May 6, 2020

consolidate duplicate code #6574

8d6a5c8

pdurbin added a commit that referenced this issue May 6, 2020

include new DuplicateFilesIT in API test suite #6574

dc77ac1

pdurbin added a commit that referenced this issue May 6, 2020

Merge branch 'develop' into 6574-filenames #6574

e79280b

kcondon closed this as completed in #6893 May 7, 2020

jsheunis mentioned this issue Jun 10, 2022

Documenting experience with Dataverse v4 issues datalad/datalad-dataverse#18

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Upload - Allow files of the same name within different directories of a dataset #6574

File Upload - Allow files of the same name within different directories of a dataset #6574

djbrooke commented Jan 24, 2020 •

edited by pdurbin

Loading

djbrooke commented Jan 24, 2020

scolapasta commented Feb 3, 2020

djbrooke commented Feb 3, 2020 •

edited

Loading

djbrooke commented Feb 5, 2020

mheppler commented Feb 5, 2020 •

edited

Loading

qqmyers commented Feb 5, 2020

landreev commented Feb 5, 2020

landreev commented Feb 5, 2020

mheppler commented Feb 7, 2020 •

edited

Loading

scolapasta commented Mar 5, 2020

scolapasta commented Mar 5, 2020

scolapasta commented Apr 28, 2020

File Upload - Allow files of the same name within different directories of a dataset #6574

File Upload - Allow files of the same name within different directories of a dataset #6574

Comments

djbrooke commented Jan 24, 2020 • edited by pdurbin Loading

djbrooke commented Jan 24, 2020

scolapasta commented Feb 3, 2020

djbrooke commented Feb 3, 2020 • edited Loading

djbrooke commented Feb 5, 2020

mheppler commented Feb 5, 2020 • edited Loading

qqmyers commented Feb 5, 2020

landreev commented Feb 5, 2020

landreev commented Feb 5, 2020

mheppler commented Feb 7, 2020 • edited Loading

scolapasta commented Mar 5, 2020

scolapasta commented Mar 5, 2020

scolapasta commented Apr 28, 2020

djbrooke commented Jan 24, 2020 •

edited by pdurbin

Loading

djbrooke commented Feb 3, 2020 •

edited

Loading

mheppler commented Feb 5, 2020 •

edited

Loading

mheppler commented Feb 7, 2020 •

edited

Loading