Add originalFileName field to json #2734

evelynPM · 2015-11-11T16:47:55Z

As per discussion at https://groups.google.com/forum/#!topic/dataverse-community/zsC4yltISS0: when a bundle is retrieved through the data access api, the json metadata contain two fields relating to the original file from which the .tab format is derived: "originalFileFormat" and "originalFormatLabel". Unless the bundle is unzipped and the package contents extracted, the filename of the original file has to be inferred from these fields. This is causing a few issues when we ingest content from Dataverse into Archivematica, because Archivematica needs to know the name of the original file before the unpackaging micro-service takes place.

donsizemore · 2017-04-20T11:30:50Z

Odum is running into this as well. I see our Dataverse instance storing the original filename as version 1; the ingested/renamed as version 2:

dvndb=> select * from filemetadata where label LIKE '%ERA21980b%';
   id    | description |     label     | restricted | version | datafile_id | datasetversion_id 
---------+-------------+---------------+------------+---------+-------------+-------------------
 3765067 |             | ERA21980b.DAT | f          |       1 |     7494454 |             29342
 3765068 |             | ERA21980b.tab | f          |       2 |     7494455 |             29342

but tied to the datasetversion_id while the native dataset metadata endpoint returns datafile_id.

My script for Thu-Mai returns all files in a dataset in original format but with modified file extensions. The automation of an original-format bundle download per dataset would save her a lot of time.

pdurbin · 2017-04-20T13:02:01Z

@donsizemore it's interesting that the original filename is stored in the database at all. It was under the impression that the original filename is never stored. It should be.

akio-sone · 2017-04-20T18:26:06Z

@pdurbin I suspect that Don misidentified "ERA21980b.DAT" as an original data file such as a stata dta file; it is not a statistical data file. As you said, the original filename is not stored on the DB by design.

landreev · 2017-06-23T19:21:48Z

So, is this still something we want to add to the JSON metadata we output for Datafiles?

The problem makes perfect sense as described by the original requester. But then it sounds like they worked around it by inferring the full original filename from the information already in the JSON. I.e., if filename = "myfile.tab", and the originalFormatLabel = "Stata Binary", you can (unambiguously) assume that the original file in the bundle will have the name "myfile.dta". This is how the filename is generated in the application; once the file is ingested, the stored file name has the ".tab" extension. For the stored original that extension is modified on the fly based on the original type saved in the database.

If we were to add this extra field to the JSON output - "originalFileName" - it would take one extra line in JsonPrinter.java:

.add("originalFileName", FileUtil.replaceExtension(fileName, FileUtil.generateOriginalExtension(df.getOriginalFileFormat()));

pdurbin · 2017-06-26T00:27:15Z

@landreev cool. Sounds like an easy fix. Thanks.

pdurbin · 2018-07-18T16:00:56Z

A comment was just added at #4044 (comment) about this: "we do not store the extension but recreate it by examining the mime type"

pdurbin · 2019-07-24T07:43:19Z

Related: Use/Maintain Appropriate File Formats for Preservation and Reproducibility #6006

djbrooke · 2020-02-12T15:39:39Z

We need to address this as part of integrations with Code Ocean, Renku, and other tools that are expecting a specific file name (with a SpEcIfIc CaSe). This is a dependency for #6085. Thanks @pdurbin for highlighting in the design meeting.

djbrooke · 2020-02-25T20:13:16Z

Estimating as Medium based on the fact that we need to handle existing files
We should include this in the UI (download original)

sekmiller · 2020-03-20T19:25:57Z

Upon further discussion it was decided that for preservation purposes we should retain the original file name and extension as it was uploaded. (not rely on the conversion from content type to extension utility)

sekmiller · 2020-03-26T19:34:03Z

We are not going to load the new originalFileName field for existing files. We will use the file extension converter described by Leonid above in the json printer. that way we will preserve the fact that we did not actually save the original file name on upload.

mercecrosas modified the milestone: In Review Nov 30, 2015

mheppler added the Feature: API label Jan 28, 2016

scolapasta added Status: Triaged and removed Status: Dev labels Jan 28, 2016

scolapasta modified the milestone: Not Assigned to a Release Jan 28, 2016

pdurbin mentioned this issue Jun 23, 2017

Preserving the original file type for ingested tabular files is broken as of 4.6.2 #3952

Closed

pdurbin added Help Wanted: Code Mentor: pdurbin and removed Triaged labels Jun 26, 2017

pdurbin added the User Role: API User Makes use of APIs label Jul 4, 2017

pdurbin mentioned this issue Jul 18, 2018

Add support for directly ingesting tab-delimited files #4044

Closed

amberleahey mentioned this issue Oct 9, 2018

Archivematica Integration #5152

Closed

mheppler added Feature: File Upload & Handling and removed Feature: API Help Wanted: Code Mentor: pdurbin User Role: API User Makes use of APIs labels Feb 12, 2020

djbrooke assigned scolapasta Feb 13, 2020

djbrooke added the Medium label Feb 25, 2020

scolapasta removed their assignment Feb 25, 2020

djbrooke assigned pdurbin Mar 3, 2020

djbrooke unassigned pdurbin Mar 5, 2020

mheppler mentioned this issue Mar 5, 2020

Use/Maintain Appropriate File Formats for Preservation and Reproducibility #6006

Closed

sekmiller self-assigned this Mar 20, 2020

sekmiller added a commit that referenced this issue Mar 25, 2020

#2734 add field to table and write to it on ingest

b30dfaf

sekmiller added a commit that referenced this issue Mar 26, 2020

#2734 add originalFileName to json printer

98d3844

sekmiller added a commit that referenced this issue Mar 26, 2020

#2734 add orig file name to json printer test

07f208b

sekmiller mentioned this issue Mar 26, 2020

2734 preserve orig filename #6774

Merged

sekmiller added a commit that referenced this issue Mar 26, 2020

#2734 remove unused import

f1ed820

sekmiller removed their assignment Mar 26, 2020

sekmiller added a commit that referenced this issue Mar 27, 2020

#2734 streamline original file name replace

715fd00

sekmiller added a commit that referenced this issue Mar 27, 2020

#2734 update uningest with new method

786593e

sekmiller added a commit that referenced this issue Mar 27, 2020

#2734 update stored original file

c468519

sekmiller added a commit that referenced this issue Mar 27, 2020

#2734 return handling of one-off cases

75b21e3

kcondon closed this as completed in #6774 Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add originalFileName field to json #2734

Add originalFileName field to json #2734

evelynPM commented Nov 11, 2015

donsizemore commented Apr 20, 2017 •

edited

Loading

pdurbin commented Apr 20, 2017

akio-sone commented Apr 20, 2017

landreev commented Jun 23, 2017

pdurbin commented Jun 26, 2017

pdurbin commented Jul 18, 2018

pdurbin commented Jul 24, 2019

djbrooke commented Feb 12, 2020

djbrooke commented Feb 25, 2020 •

edited

Loading

sekmiller commented Mar 20, 2020

sekmiller commented Mar 26, 2020

Add originalFileName field to json #2734

Add originalFileName field to json #2734

Comments

evelynPM commented Nov 11, 2015

donsizemore commented Apr 20, 2017 • edited Loading

pdurbin commented Apr 20, 2017

akio-sone commented Apr 20, 2017

landreev commented Jun 23, 2017

pdurbin commented Jun 26, 2017

pdurbin commented Jul 18, 2018

pdurbin commented Jul 24, 2019

djbrooke commented Feb 12, 2020

djbrooke commented Feb 25, 2020 • edited Loading

sekmiller commented Mar 20, 2020

sekmiller commented Mar 26, 2020

donsizemore commented Apr 20, 2017 •

edited

Loading

djbrooke commented Feb 25, 2020 •

edited

Loading