-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset class should support encoding parameter to override global attribute #654
Comments
if someone from the team can give feedback on how they'd like this achieved I can work on this. |
I suggest adding a |
not possible because we need encoding during init, for example when it calls |
ok added support for |
Does netCDF really support arbitrary encodings for strings but not have any way of indicating them in the data model? That seems like a disaster waiting to happen... |
from what I understand yes! I ran into this with MADIS mesonet dataset that had the "Annœullin" string in it (has character \x9c in it), From what I saw there was nothing specifying the encoding :( After some grepping around it seemed like this was most likely a CP1252 format from someone generating the files on a windows box basically netCDF 3.x was not designed with a true "string" type, only added in netCDF 4.x. For being a "self-describing" format this was a huge oversight. |
Even with the netcdf-4 When it comes to names of variables, dimensions, attributes, groups, and types netcdf-c always uses UTF-8 encoding. |
Just noticed this at http://www.unidata.ucar.edu/software/netcdf/docs/file_format_specifications.html Note on char data: Although the characters used in netCDF names must be encoded as UTF-8, character data may use other encodings. The variable attribute “_Encoding” is reserved for this purpose in future implementations and here http://www.unidata.ucar.edu/software/netcdf/docs/netcdf_utilities_guide.html The netCDF char type contains uninterpreted characters, one character per byte. Typically these contain 7-bit ASCII characters, but the character encoding is application specific. For this reason, applications writing data using the enhanced data model are encouraged to use the netCDF-4 string data type in preference to the char data type. Applications writing string data using the char data type are encouraged to add the special variable attribute "_Encoding" with a value that the netCDF libraries recognize. Currently those valid values are "UTF-8" or "ASCII", case insensitive. which suggests that for NC_STRING variables (and attributes?) we should look for an attribute |
hmm, still need to support files which don't specify the |
I doubt that it is really used much. Perhaps we should check for it though. @WardF or @DennisHeimbigner - if you get a chance, could you read through this thread and comment? |
btw another thing is my PR applies the encoding to all places the |
You are correct: all netcdf names are assumed to be utf8, except that the character '/' is always |
For character data, ASCII with |
One more point. Character typed and String typed attributes must always be UTF8 |
The _Encoding attribute was under discussion recently on the CF mailing list. @rsignell-usgs and I were just this morning discussing this topic and decided to create an issue on the Unidata/netcdf-c repo to update the NUG wording around the _Encoding attribute. |
hah, as a side-note, I just found that some MADIS mesonet files are NOT in cp1252 as they fail to decode with that encoding, so it seems the files are a mixture of encodings without specifying what the encodings are :( update: even worse, has garbage data as it doesn't seem to be in any reasonable encoding...another idea then is adding encoding validation when setting string data. |
So to summarize...
Does this sound reasonable? For (1) do we really need a For (4), could we look for a Dataset or Variable |
Feedback from parts affecting me
|
this issue is address by pull request #665, which adds detection of the |
pull request #665 merged, closing for now. |
In fact the whole idea of having a global
encoding
property is a bad idea because then you can't support multiple Datasets with different encodings. The real bug should be to deprecatenetCDF4.encoding
as an example, the MADIS meso files are encoded in what appears to be cp1252.
The text was updated successfully, but these errors were encountered: