Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add originalFileName field to json #2734

Closed
evelynPM opened this issue Nov 11, 2015 · 12 comments · Fixed by #6774
Closed

Add originalFileName field to json #2734

evelynPM opened this issue Nov 11, 2015 · 12 comments · Fixed by #6774

Comments

@evelynPM
Copy link

As per discussion at https://groups.google.com/forum/#!topic/dataverse-community/zsC4yltISS0: when a bundle is retrieved through the data access api, the json metadata contain two fields relating to the original file from which the .tab format is derived: "originalFileFormat" and "originalFormatLabel". Unless the bundle is unzipped and the package contents extracted, the filename of the original file has to be inferred from these fields. This is causing a few issues when we ingest content from Dataverse into Archivematica, because Archivematica needs to know the name of the original file before the unpackaging micro-service takes place.

@mercecrosas mercecrosas modified the milestone: In Review Nov 30, 2015
@scolapasta scolapasta modified the milestone: Not Assigned to a Release Jan 28, 2016
@donsizemore
Copy link
Contributor

donsizemore commented Apr 20, 2017

Odum is running into this as well. I see our Dataverse instance storing the original filename as version 1; the ingested/renamed as version 2:

dvndb=> select * from filemetadata where label LIKE '%ERA21980b%';
   id    | description |     label     | restricted | version | datafile_id | datasetversion_id 
---------+-------------+---------------+------------+---------+-------------+-------------------
 3765067 |             | ERA21980b.DAT | f          |       1 |     7494454 |             29342
 3765068 |             | ERA21980b.tab | f          |       2 |     7494455 |             29342

but tied to the datasetversion_id while the native dataset metadata endpoint returns datafile_id.

My script for Thu-Mai returns all files in a dataset in original format but with modified file extensions. The automation of an original-format bundle download per dataset would save her a lot of time.

@pdurbin
Copy link
Member

pdurbin commented Apr 20, 2017

@donsizemore it's interesting that the original filename is stored in the database at all. It was under the impression that the original filename is never stored. It should be.

@akio-sone
Copy link
Contributor

@pdurbin I suspect that Don misidentified "ERA21980b.DAT" as an original data file such as a stata dta file; it is not a statistical data file. As you said, the original filename is not stored on the DB by design.

@landreev
Copy link
Contributor

So, is this still something we want to add to the JSON metadata we output for Datafiles?

The problem makes perfect sense as described by the original requester. But then it sounds like they worked around it by inferring the full original filename from the information already in the JSON. I.e., if filename = "myfile.tab", and the originalFormatLabel = "Stata Binary", you can (unambiguously) assume that the original file in the bundle will have the name "myfile.dta". This is how the filename is generated in the application; once the file is ingested, the stored file name has the ".tab" extension. For the stored original that extension is modified on the fly based on the original type saved in the database.

If we were to add this extra field to the JSON output - "originalFileName" - it would take one extra line in JsonPrinter.java:

.add("originalFileName", FileUtil.replaceExtension(fileName, FileUtil.generateOriginalExtension(df.getOriginalFileFormat()));

@pdurbin
Copy link
Member

pdurbin commented Jun 26, 2017

@landreev cool. Sounds like an easy fix. Thanks.

@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2018

A comment was just added at #4044 (comment) about this: "we do not store the extension but recreate it by examining the mime type"

@pdurbin
Copy link
Member

pdurbin commented Jul 24, 2019

Related: Use/Maintain Appropriate File Formats for Preservation and Reproducibility #6006

@djbrooke
Copy link
Contributor

We need to address this as part of integrations with Code Ocean, Renku, and other tools that are expecting a specific file name (with a SpEcIfIc CaSe). This is a dependency for #6085. Thanks @pdurbin for highlighting in the design meeting.

@scolapasta scolapasta removed their assignment Feb 25, 2020
@djbrooke
Copy link
Contributor

djbrooke commented Feb 25, 2020

  • Estimating as Medium based on the fact that we need to handle existing files
  • We should include this in the UI (download original)

@sekmiller
Copy link
Contributor

Upon further discussion it was decided that for preservation purposes we should retain the original file name and extension as it was uploaded. (not rely on the conversion from content type to extension utility)

@sekmiller
Copy link
Contributor

We are not going to load the new originalFileName field for existing files. We will use the file extension converter described by Leonid above in the json printer. that way we will preserve the fact that we did not actually save the original file name on upload.

sekmiller added a commit that referenced this issue Mar 26, 2020
@sekmiller sekmiller removed their assignment Mar 26, 2020
sekmiller added a commit that referenced this issue Mar 27, 2020
sekmiller added a commit that referenced this issue Mar 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants