Is a method to determine sysmeta formatId, mediaType needed? #77

gothub · 2017-04-11T16:25:55Z

When a script creates a SystemMetadata object (i.e. when a DataObject is created), the
sysmeta formatId must be specified.

Is it advisable to have a method that automatically determines the formatId by file
extension or file contents? This is an old problem, with know issues such as reliability.

Is it advisable to have an automated way to determine formatId, or to rely on the
user determining and specifying this?

mbjones · 2017-04-11T20:39:19Z

@amoeba has a function that tries to guess the formatId. Its nice when it works. It doesn't always work, so we've been discussing whether no default is better than an incorrect guess. Let's discuss further. Maybe Jeanette and Jesse have thoughts on this too.

amoeba · 2017-04-11T20:52:47Z

Yeah, guess_format_id uses a hard-coded map between D1 format IDs and file extensions: https://github.com/NCEAS/arcticdatautils/blob/master/R/util.R#L79. I threw in a custom routine for NetCDF files that uses the metadata to guess the specific NetCDF version but otherwise things are based on file extension alone.

There are limitations and even major issues:

A file may be missing an extension so a guessing routine may decide on a less specific format ID than the user might intend
A file may be using a different file extension than expected (e.g. .txt for CSV/TSV) so a guessing routine may decide on a less specific format ID than the user intended
Some file extensions would need special handling routines (e.g. XML, NetCDF) which basically results in an arms race between this guessing routine and the D1 formats list. i.e. as we add formats to the CN formats list, this routine needs to be updated

From a user perspective, I have been told the guessing is nice but I don't personally feel like it's really necessary. If the format ID isn't guessed, I think giving users a useful mechanism in R to find the available values would be needed. e.g.,

> magicUploadFunction(my_path)
Error: You must specify the format_id argument when using magicUploadedFunction. Run `formatsList()` to see a list of possible values.
> formatsList()
format_idid                             Name        Type
eml://ecoinformatics.org/eml-2.0.0      EML 2.0.0   METADATA  
eml://ecoinformatics.org/eml-2.1.0      EML 2.1.0   METADATA  
eml://ecoinformatics.org/eml-2.1.1      EML 2.1.1   METADATA  
text/csv                                CSV         DATA

gothub added the question label Apr 11, 2017

gothub added this to the 1.4.0 milestone Jun 29, 2017

gothub modified the milestones: 1.4.0, 1.5.0 Sep 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is a method to determine sysmeta formatId, mediaType needed? #77

Is a method to determine sysmeta formatId, mediaType needed? #77

gothub commented Apr 11, 2017

mbjones commented Apr 11, 2017

amoeba commented Apr 11, 2017

Is a method to determine sysmeta formatId, mediaType needed? #77

Is a method to determine sysmeta formatId, mediaType needed? #77

Comments

gothub commented Apr 11, 2017

mbjones commented Apr 11, 2017

amoeba commented Apr 11, 2017