Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display deposited (rather than ingested) copy of tabular files #7956

Open
adam3smith opened this issue Jun 20, 2021 · 21 comments
Open

Display deposited (rather than ingested) copy of tabular files #7956

adam3smith opened this issue Jun 20, 2021 · 21 comments
Labels
Component: JSF Involves modifying JSF (Jakarta Server Faces) code, which is being replaced with React. Feature: File Upload & Handling Type: Suggestion an idea User Role: Depositor Creates datasets, uploads data, etc.

Comments

@adam3smith
Copy link
Contributor

This feature request comes out of the discussion on data curation at the 2021 DV community meeting:

Current behavior
When an ingestable tabular file is deposited (.xlsx, .sav, .dta), the default download format (and the displayed file extension) is the ingested .tab version of the file. The original file format is available from the File access menu together with file-level metadata and the explorer tools

Suggested behavior
I suggest that the deposited file format is better suited as the default download format, with .tab (or .tsv as it should be called ;)) being available through the File access menu

Rationale
There are several reasons deposited file formats are preferable:

  1. The default display of the ingested file is confusing for depositors, as @amberleahey noted during the discussion.
  2. Frequently, deposit format are richer than the extracted .tab. E.g., Excel files may have additional rich text formatting, which makes them easier to ready than their plain text counterparts
  3. In some cases, ingest can cause data loss (e.g. for Excel files with multiple tabs, undesirable as those may be). Defaulting to the deposited format somewhat mitigates this, even though it is still problematic.

On a more theoretical level, in the terminology of the OAIS reference mode, we clearly have the SIP (the deposited file) and the AIP (the archived/preservation copy) defined and the question is which of the two is the better DIP. I would argue that is the more commonly useable and often richer data format -- that not just the case for Excel, but also for things like .sav files which include rich metadata that reads nicely not just into SPSS but also into tools like R with appropriate packages.

cc @sbarbosadataverse who was also part of this dicussion

@pdurbin
Copy link
Member

pdurbin commented Jun 21, 2021

There was a similar conversation about the "download all" button in #4000.

Since many repositories include code that expects data files to be in a particular format, it's frustrating that dataverse defaults to downloading data files as .tab.

IMHO, the default should be the original file format, with options for all the others.

This was fixed for "download all" in pull request #4979.

@amberleahey
Copy link

I would support renaming this to .tsv if it is in fact not distinct (.tab vs. .tsv), but does the Dataverse software extract metadata and structure the .tab differently than a .tsv?

Regarding the display of the .tab for end users, it's interesting, some users do get confused and they think their original data is gone or less promoted, and the improvement @pdurbin mentioned above is great! So perhaps it's just a matter of improving the labelling of the .tab file in the file listing? Could it be moved to the bottom of the list? Add a label 'Preservation copy'?

@TaniaSchlatter
Copy link
Member

"original format" is listed first now:
Screen Shot 2021-06-22 at 2 08 34 PM

@adam3smith
Copy link
Contributor Author

(the .tsv vs. .tab discussion is in #6006, but just to re-iterate, these are literally the same format. If you want Excel to open a .tab file, you change the extension to .tsv)

@amberleahey
Copy link

@TaniaSchlatter could the .tab also appear lower in the main file listing window (or have something to differentiate it from other original files?) not just in the download file access window?

@TaniaSchlatter
Copy link
Member

TaniaSchlatter commented Jun 22, 2021

@amberleahey, my quick response is that I see this as an opportunity for an automatic file tag, "Original Format" which users could filter on to change the order in the file table.

@BPeuch
Copy link
Contributor

BPeuch commented Aug 9, 2021

Sounds like this issue can be closed now, doesn't it?

The distinction between original format and .tab in the dropdown menu is very handy!

@adam3smith
Copy link
Contributor Author

adam3smith commented Aug 9, 2021

@BPeuch - no, the main request here has not been implemented, although I think there's agreement for it.

While, as @TaniaSchlatter notes, the original format is (now? not sure this is new) listed above the tab/archival format in the dropdown, the default download/display format continues to be .tab. See e.g. on demo.dataverse here: https://demo.dataverse.org/file.xhtml?persistentId=doi:10.70122/FK2/OXQWCP/DU8NM2&version=1.0

@sbarbosadataverse
Copy link

sbarbosadataverse commented Aug 9, 2021 via email

@BPeuch
Copy link
Contributor

BPeuch commented Aug 9, 2021

Oh my bad @adam3smith I did not realize this format was still prominent specifically in file webpages.

@adam3smith
Copy link
Contributor Author

Same on the dataset page -- .tab is still the default display&download format in the list of files as well

@BPeuch
Copy link
Contributor

BPeuch commented Aug 10, 2021

That is true. I thought the problem was only about the 'Download' button but I can see the arguments for highlighting the original format. Users who cannot reuse, say, SPSS files will know to look for alternatives (such as a .tsv output) either way.

@sbarbosadataverse
Copy link

sbarbosadataverse commented Aug 27, 2021 via email

@pdurbin pdurbin added Type: Suggestion an idea Feature: File Upload & Handling User Role: Depositor Creates datasets, uploads data, etc. Component: JSF Involves modifying JSF (Jakarta Server Faces) code, which is being replaced with React. labels Oct 13, 2022
@qqmyers
Copy link
Member

qqmyers commented Sep 28, 2023

I've been looking into this at QDR and it looks relatively straight forward to change the display in the file table, on the file page, and in the file citations - i.e. giving the name, size, checksum, content type of the original. And the download menu has already been changed to show the original as the first option.

The one aspect of functionality I've run into so far where this change is somewhat problematic is in allowing the filename to be edited (for a given dataset version). Currently, what you change is the name of the ingested file (i.e. the *.tab version) and if you change the file extension, you are changing the extension of the tab file. The original file gets <newname>.<original extension>. The mimetype itself is not updated (and can't otherwise be changed) if you do update the extension.

I think this could be changed so you edit the name of the original instead - but that would involve dealing with any existing files where the tab version has had the extension changed which would be more work.

Alternately, things could be changed so you can't edit extensions. That seems like it could work for QDR, but may not be acceptable generally.

So - to move this forward - any thoughts on whether being able to change the extensions on the original and/or tab versions are needed, or how to address legacy data if we just flip to allowing the original file's extension to be changed, etc. are welcome.

@pdurbin
Copy link
Member

pdurbin commented Sep 28, 2023

What options do we have for changing the filename and extension via API? If we forbid changing the extension via UI, can it be changed via API? Is there a workaround, I mean.

(By the way, it would be kind of cool if you could change the extension and have Dataverse redetect the mimetype.)

@DS-INRAE
Copy link
Member

I created #10067 to dissociate the specific aspect of editing extensions apart and make it more granular and have the discussion on this aspect there 😃

@qqmyers
Copy link
Member

qqmyers commented Nov 21, 2024

FWIW: QDR did implement this, making the extension non-editable.

@vkush
Copy link

vkush commented Dec 5, 2024

Same on the dataset page -- .tab is still the default display&download format in the list of files as well

As download behaviour was already extended, do we also have some updates for display format to show the originally deposited file format (e.g. .csv) instead of .tab?

@pdurbin
Copy link
Member

pdurbin commented Dec 5, 2024

@vkush this issue is still open so, no, the .tab is still always shown. Pull requests are welcome! ❤️

@qqmyers
Copy link
Member

qqmyers commented Dec 5, 2024

If the way QDR implemented is acceptable, I can make a PR of that (or if that helps as a starting point). Presumably this could/should have an SPA issue as well?

@pdurbin
Copy link
Member

pdurbin commented Dec 5, 2024

@qqmyers please! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: JSF Involves modifying JSF (Jakarta Server Faces) code, which is being replaced with React. Feature: File Upload & Handling Type: Suggestion an idea User Role: Depositor Creates datasets, uploads data, etc.
Projects
Status: Implemented at QDR
Development

No branches or pull requests

9 participants