Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset class should support encoding parameter to override global attribute #654

Closed
thehesiod opened this issue Apr 30, 2017 · 20 comments
Closed

Comments

@thehesiod
Copy link
Contributor

thehesiod commented Apr 30, 2017

In fact the whole idea of having a global encoding property is a bad idea because then you can't support multiple Datasets with different encodings. The real bug should be to deprecate netCDF4.encoding

as an example, the MADIS meso files are encoded in what appears to be cp1252.

@thehesiod
Copy link
Contributor Author

if someone from the team can give feedback on how they'd like this achieved I can work on this.

@thehesiod thehesiod changed the title Dataset class should support encoding parameter to override global param Dataset class should support encoding parameter to override global attribute Apr 30, 2017
@jswhit
Copy link
Collaborator

jswhit commented Apr 30, 2017

I suggest adding a set_encoding method to override the global value. There are already too many kwargs in Dataset.__init__. The new method should have a kwarg to optionally override the global value of unicode_error also.

@thehesiod
Copy link
Contributor Author

thehesiod commented Apr 30, 2017

not possible because we need encoding during init, for example when it calls _get_dims. Will looking into unicode_error

@thehesiod
Copy link
Contributor Author

ok added support for encoding_errors looking into why unittests are failing

@shoyer
Copy link
Contributor

shoyer commented May 1, 2017

Does netCDF really support arbitrary encodings for strings but not have any way of indicating them in the data model? That seems like a disaster waiting to happen...

@thehesiod
Copy link
Contributor Author

thehesiod commented May 1, 2017

from what I understand yes! I ran into this with MADIS mesonet dataset that had the "Annœullin" string in it (has character \x9c in it), From what I saw there was nothing specifying the encoding :( After some grepping around it seemed like this was most likely a CP1252 format from someone generating the files on a windows box

Some info: http://www.unidata.ucar.edu/software/netcdf/docs/BestPractices.html#bp_Strings-and-Variables-of-type-char

basically netCDF 3.x was not designed with a true "string" type, only added in netCDF 4.x. For being a "self-describing" format this was a huge oversight.

@jswhit
Copy link
Collaborator

jswhit commented May 2, 2017

Even with the netcdf-4 NC_STRING type there is no concept of an encoding in the data model. It's just stored as a string of bytes in the file, and the client has to know how to encode it into a string. I can't find anything in the CF metadata standard that relates to string encoding, so there's no standard way for the client to figure out how to encode it. The python interface always returns a string, and the encoding is currently defined by a global module variable. I gather the purpose of pull request #655 is to at least allow the user to change the encoding on a per-Dataset basis, so multiple Datasets can be accessed at once with different encodings specified.

When it comes to names of variables, dimensions, attributes, groups, and types netcdf-c always uses UTF-8 encoding.

@jswhit
Copy link
Collaborator

jswhit commented May 2, 2017

Just noticed this at http://www.unidata.ucar.edu/software/netcdf/docs/file_format_specifications.html

Note on char data: Although the characters used in netCDF names must be encoded as UTF-8, character data may use other encodings. The variable attribute “_Encoding” is reserved for this purpose in future implementations

and here http://www.unidata.ucar.edu/software/netcdf/docs/netcdf_utilities_guide.html

The netCDF char type contains uninterpreted characters, one character per byte. Typically these contain 7-bit ASCII characters, but the character encoding is application specific. For this reason, applications writing data using the enhanced data model are encouraged to use the netCDF-4 string data type in preference to the char data type. Applications writing string data using the char data type are encouraged to add the special variable attribute "_Encoding" with a value that the netCDF libraries recognize. Currently those valid values are "UTF-8" or "ASCII", case insensitive.

which suggests that for NC_STRING variables (and attributes?) we should look for an attribute _Encoding.

@thehesiod
Copy link
Contributor Author

thehesiod commented May 2, 2017

hmm, still need to support files which don't specify the _Encoding. But sounds like enforcing _Encoding needs to be added to netcdf4-python? In my particular case there was no _Encoding attribute.

@jswhit
Copy link
Collaborator

jswhit commented May 2, 2017

I doubt that it is really used much. Perhaps we should check for it though. @WardF or @DennisHeimbigner - if you get a chance, could you read through this thread and comment?

@thehesiod
Copy link
Contributor Author

btw another thing is my PR applies the encoding to all places the default_encoding was used before, based on what you found it sounds like some things like attribute names should always be UTF8 with some restrictions (which I'm guessing aren't checked for in the cython code).

@DennisHeimbigner
Copy link
Collaborator

You are correct: all netcdf names are assumed to be utf8, except that the character '/' is always
disallowed. When names occur inside, say, a cdl file, then certain other characters must be
back-slash escaped.
The default encoding for strings (as opposed to characters) is utf8, and ncdump, for example,
will assume that in the absence of any _Encoding attribute. The _Encoding attribute is currently ignored
in the netcdf-c code (reminder to self: add issue about at least recognizing it).
Character data is the real problem. Technically, is defaults also to utf8, which means the ascii
subset of utf8. In practice, characters can have any 8-bit bit pattern.

@shoyer
Copy link
Contributor

shoyer commented May 2, 2017

For character data, ASCII with errors='surrogateescape' can be a good option for decoding into Python strings, when there is some possibility that somebody has stuffed arbitrary bytes in there. At least then, you can safely encode back into bytes and decode with the right encoding.

@DennisHeimbigner
Copy link
Collaborator

One more point. Character typed and String typed attributes must always be UTF8
because there is no way to specify _Encoding for an attribute.

@ethanrd
Copy link
Member

ethanrd commented May 2, 2017

The _Encoding attribute was under discussion recently on the CF mailing list. @rsignell-usgs and I were just this morning discussing this topic and decided to create an issue on the Unidata/netcdf-c repo to update the NUG wording around the _Encoding attribute.

@thehesiod
Copy link
Contributor Author

thehesiod commented May 2, 2017

hah, as a side-note, I just found that some MADIS mesonet files are NOT in cp1252 as they fail to decode with that encoding, so it seems the files are a mixture of encodings without specifying what the encodings are :(

update: even worse, has garbage data as it doesn't seem to be in any reasonable encoding...another idea then is adding encoding validation when setting string data.

@jswhit
Copy link
Collaborator

jswhit commented May 3, 2017

So to summarize...

  1. we should look for an _Encoding attribute for NC_STRING variable data, and use it for encoding, otherwise either use a default value or the value specified by a newset_encoding Dataset method.
  2. for encoding character data into python strings we should use ASCII with errors=surrogateescape (i.e. in the chartostring utility function). In stringtochar we should decode using ascii.
  3. For names of variables, dimensions, attributes and groups, always use UTF-8.
  4. For attributes (either string or character) Dennis suggests we should always use UTF-8.

Does this sound reasonable?

For (1) do we really need a set_encoding method, or should we just rely on an _Encoding attribute and use UTF-8 if it's not there?

For (4), could we look for a Dataset or Variable _Encoding attributes and assume it applies to NC_STRING and NC_CHAR attributes?

@thehesiod
Copy link
Contributor Author

thehesiod commented May 3, 2017

Feedback from parts affecting me

  1. prefer having an encoding __init__ parameter as the current code relies on it during initialization also it makes more sense, I don't think you want to be able to change encoding modes of the Dataset after it's been opened, I think logically it should be fixed once it's been opened.
  2. not sure what this means, I know of CDF "classic" files that aren't utf-8 and doing this will make consumers of these files more difficult (will have to parse each string 2x)...instead of simply passing an encoding parameter, or perhaps could be named fallback_encoding (if not specified).
    3 + 4. After doing this I think it will greatly simplify where encoding is used.

@jswhit
Copy link
Collaborator

jswhit commented May 19, 2017

this issue is address by pull request #665, which adds detection of the _Encoding attribute and the addition of the kwarg encoding to chartostring.

@jswhit
Copy link
Collaborator

jswhit commented May 19, 2017

pull request #665 merged, closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants