You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a script creates a SystemMetadata object (i.e. when a DataObject is created), the
sysmeta formatId must be specified.
Is it advisable to have a method that automatically determines the formatId by file
extension or file contents? This is an old problem, with know issues such as reliability.
Is it advisable to have an automated way to determine formatId, or to rely on the
user determining and specifying this?
The text was updated successfully, but these errors were encountered:
@amoeba has a function that tries to guess the formatId. Its nice when it works. It doesn't always work, so we've been discussing whether no default is better than an incorrect guess. Let's discuss further. Maybe Jeanette and Jesse have thoughts on this too.
Yeah, guess_format_id uses a hard-coded map between D1 format IDs and file extensions: https://github.com/NCEAS/arcticdatautils/blob/master/R/util.R#L79. I threw in a custom routine for NetCDF files that uses the metadata to guess the specific NetCDF version but otherwise things are based on file extension alone.
There are limitations and even major issues:
A file may be missing an extension so a guessing routine may decide on a less specific format ID than the user might intend
A file may be using a different file extension than expected (e.g. .txt for CSV/TSV) so a guessing routine may decide on a less specific format ID than the user intended
Some file extensions would need special handling routines (e.g. XML, NetCDF) which basically results in an arms race between this guessing routine and the D1 formats list. i.e. as we add formats to the CN formats list, this routine needs to be updated
From a user perspective, I have been told the guessing is nice but I don't personally feel like it's really necessary. If the format ID isn't guessed, I think giving users a useful mechanism in R to find the available values would be needed. e.g.,
When a script creates a SystemMetadata object (i.e. when a DataObject is created), the
sysmeta formatId must be specified.
Is it advisable to have a method that automatically determines the formatId by file
extension or file contents? This is an old problem, with know issues such as reliability.
Is it advisable to have an automated way to determine formatId, or to rely on the
user determining and specifying this?
The text was updated successfully, but these errors were encountered: