Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is a method to determine sysmeta formatId, mediaType needed? #77

Open
gothub opened this issue Apr 11, 2017 · 2 comments
Open

Is a method to determine sysmeta formatId, mediaType needed? #77

gothub opened this issue Apr 11, 2017 · 2 comments
Labels
Milestone

Comments

@gothub
Copy link
Contributor

gothub commented Apr 11, 2017

When a script creates a SystemMetadata object (i.e. when a DataObject is created), the
sysmeta formatId must be specified.

Is it advisable to have a method that automatically determines the formatId by file
extension or file contents? This is an old problem, with know issues such as reliability.

Is it advisable to have an automated way to determine formatId, or to rely on the
user determining and specifying this?

@mbjones
Copy link
Member

mbjones commented Apr 11, 2017

@amoeba has a function that tries to guess the formatId. Its nice when it works. It doesn't always work, so we've been discussing whether no default is better than an incorrect guess. Let's discuss further. Maybe Jeanette and Jesse have thoughts on this too.

@amoeba
Copy link
Contributor

amoeba commented Apr 11, 2017

Yeah, guess_format_id uses a hard-coded map between D1 format IDs and file extensions: https://github.com/NCEAS/arcticdatautils/blob/master/R/util.R#L79. I threw in a custom routine for NetCDF files that uses the metadata to guess the specific NetCDF version but otherwise things are based on file extension alone.

There are limitations and even major issues:

  • A file may be missing an extension so a guessing routine may decide on a less specific format ID than the user might intend
  • A file may be using a different file extension than expected (e.g. .txt for CSV/TSV) so a guessing routine may decide on a less specific format ID than the user intended
  • Some file extensions would need special handling routines (e.g. XML, NetCDF) which basically results in an arms race between this guessing routine and the D1 formats list. i.e. as we add formats to the CN formats list, this routine needs to be updated

From a user perspective, I have been told the guessing is nice but I don't personally feel like it's really necessary. If the format ID isn't guessed, I think giving users a useful mechanism in R to find the available values would be needed. e.g.,

> magicUploadFunction(my_path)
Error: You must specify the format_id argument when using magicUploadedFunction. Run `formatsList()` to see a list of possible values.
> formatsList()
format_idid                             Name        Type
eml://ecoinformatics.org/eml-2.0.0      EML 2.0.0   METADATA  
eml://ecoinformatics.org/eml-2.1.0      EML 2.1.0   METADATA  
eml://ecoinformatics.org/eml-2.1.1      EML 2.1.1   METADATA  
text/csv                                CSV         DATA

@gothub gothub added this to the 1.4.0 milestone Jun 29, 2017
@gothub gothub modified the milestones: 1.4.0, 1.5.0 Sep 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants