Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conventions for string and character array encoding #402

Open
rsignell-usgs opened this issue May 2, 2017 · 42 comments
Open

Conventions for string and character array encoding #402

rsignell-usgs opened this issue May 2, 2017 · 42 comments

Comments

@rsignell-usgs
Copy link

rsignell-usgs commented May 2, 2017

As discussed here Unidata/netcdf4-python#654 (comment), there is a need for conventions to specify the encoding of strings and character arrays in netcdf.

There is also a need to specify whether char arrays in NetCDF3 contain strings or character arrays.

@BobSimons addressed these issues in an enhancement to CF conventions that would specify charset for NetCDF3 and _Encoding for NetCDF4, and the Unidata gang (@DennisHeimbigner, @WardF, @ethanrd and @cwardgar) agreed with the concept, but suggested this be handled in the NUG and we came up with this slightly different proposal that would still accomplish Bob's goals of making it easy for software to figure out what is stuffed in those char or string arrays!

Proposal:

  • Use _CharType variable attribute with allowed values ['STRING', 'CHAR_ARRAY'] to specify if a char array variable should be interpreted as a string or as an array of individual characters. If _CharType is missing, default is 'STRING'.
  • Use _Encoding variable attribute with allowed values ['ISO-8859-1', 'ISO-8859-15', 'UTF-8'] to specify the encoding. If _Encoding is missing for _CharType='STRING', default is 'UTF-8'. If _Encoding is missing for _CharType='CHAR_ARRAY', default is 'ISO-8859-15'.
@ethanrd
Copy link
Member

ethanrd commented May 2, 2017

Should an _Encoding attribute on a 'char' typed variable be restricted to a 7- or 8-bit encoding?

@ethanrd
Copy link
Member

ethanrd commented May 2, 2017

As @DennisHeimbigner mentions here Unidata/netcdf4-python#654 (comment), this proposal does deals only with char or String typed variables, not char or String typed attributes.

@jswhit
Copy link

jswhit commented May 3, 2017

Why wouldn't _Encoding apply to attributes as well as variable data?

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented May 3, 2017 via email

@rsignell-usgs
Copy link
Author

@DennisHeimbigner, I guessing you can answer @ethanrd question:

Should an _Encoding attribute on a char typed variable be restricted to a 7- or 8-bit encoding?

@DennisHeimbigner
Copy link
Collaborator

DennisHeimbigner commented May 3, 2017

Should an _Encoding attribute on a char typed variable be restricted to a 7- or 8-bit encoding?

If an _Encoding is specified, then that technically determines 7 vs 8 bit. E.g. Ascii is 7 bit, but
ISO-Latin-8859-1 is 8-bit. The tricky case is when something like utf-8 encoding is specified.
Technically, the single character subset of utf-8 is 7-bit ascii. But, it is clear that some users
treat an array of chars as a string, in which case any legal utf-8 char bit pattern should be legal.
IMO, we should always treat char as essentially equivalent to unsigned byte so that a char
can hold any 8-bit bit pattern so that e.g. _Encoding = "iso-latin-8859-1" is legal and does not
lose information.

@ethanrd
Copy link
Member

ethanrd commented May 3, 2017

I suggested always indicating the encoding/charset with the same attribute (_Encoding), whether string or character, thinking it simplified things. Now that restrictions on allowed encodings has come up, I'm seeing the wisdom of @BobSimons original proposal to the CF list with one attribute that gives a string encoding for use when interpreting strings and one that gives the character set for use when interpreting individual characters. (It avoids the need for different restrictions on the value of _Encoding depending on the situation.)

So, as an alternate to the above proposal, I'll restate Bob's proposal here with a change or two given the target is the NUG rather than CF:

  • Use the _CharSet variable attribute to indicate that a char array should be interpreted as individual 8-bit characters. The value of the attribute gives the 8-bit character set to use when interpreting the 8-bit characters. (E.g., 'ISO-8859-15'.)

  • Use the _Encoding variable attribute to indicate which character encoding should be used to interpret a string variable. Used with a char array, the attribute indicates that it should be interpreted as a string (or an array of strings).


Reviewing Bob's original proposal brought up a number of questions on how the netCDF-4 and HDF5 libraries handle string encoding (if they enforce the encoding or not, etc.). I'm still digging and will report back when I get somewhere.

Also, there was some question in the CF discussion on whether an explicit indicator was needed to differentiate between whether a char array should be interpreted as individual 8-bit characters or as a string(s). Since the current proposal is suggesting a change to the NUG, I'm not sure if this question will or should play out the same as in a CF discussion.

@rsignell-usgs
Copy link
Author

@lesserwhirls, do you have any thoughts here? @ethanrd are you still looking at this, or can we propose the above changes to NUG?

@DennisHeimbigner
Copy link
Collaborator

I do not understand the need for the _CharSet attribute. The type of the variable (char vs String)
and the _Encoding attribute seem to me to encompass _CharSet. That is
_Encoding for char type == CharSet

@rsignell-usgs
Copy link
Author

rsignell-usgs commented May 8, 2017

@DennisHeimbigner, the problem is that while netcdf4 has char or string, netcdf3 has only char. So we don't know whether the netcdf3 char holds a string or an array of 8 bit characters.

@BobSimons
Copy link

@DennisHeimbigner, this is an alternative proposal.
In this proposal _CharSet and _Encoding apply to different situations, have different options, and are used differently (mandatory vs optional):

_CharSet would be for char variables when they are to be interpreted as individual chars.
The options are ISO-8859-1 and ISO-8859-15.
_CharSet is mandatory if the chars should be interpreted as individual chars.

_Encoding would be for String variables (e.g., in nc4) and char variables in nc3 which should be interpreted as Strings.
The options are ISO-8859-1, ISO-8859-15, and UTF-8. [Different!]
_Encoding is optional. The default is UTF-8. [Different!]

A further advantage is that only one attribute is needed per variable, not two.

Think of it from a software reader's point of view:

  • Is there a _CharSet attribute? Then these are chars and I now know the charset.
  • Is there an _Encoding attribute? Then these are strings and I now know the encoding.
  • Is there neither? Then these are UTF-8 strings.

@DennisHeimbigner
Copy link
Collaborator

I think the term "mandatory" is being misused here since a default is defined.
But the real issue in Bob's proposal has to do with whether a character typed
variable (or attribute?) is to be treated as if it was a surrogate for the lack of
a String type in netcdf-3 and do we want a special attribute to mark that case.
Personally undecided on that issue.

@DennisHeimbigner
Copy link
Collaborator

One other question. If we had an attribute to indicate that a char array
should be treated like a string, do we want to limit the use of that
attribute to netcdf-3 only. Since netcdf-4 has string type, that attribute
is technically not needed.

@ethanrd
Copy link
Member

ethanrd commented May 8, 2017

@BobSimons Given the backward compatibility issues, I'm not sure the NUG should specify how character arrays are interpreted when the proposed attributes are not used. At least not at the level of a MUST.

@rsignell-usgs
Copy link
Author

The proposed default behavior is to assume that a netcdf3 char array is a string.
With Bob's proposal, if a_CharSet attribute is found, we know it's not a string.

@BobSimons
Copy link

With the original proposal, an nc3 file might have:

  someMonths(a=5, b=10)
    _CharType="STRING"
    _Encoding="UTF-8"
  someStatus(c=4, d=2)
    _CharType="CHAR_ARRAY"
    _Encoding="ISO-8859-1"

With the alternative proposal, that nc3 file would have:

  someMonths(a=5, b=10)
    _Encoding="UTF-8"
  someStatus(c=4, d=2)
    _CharSet="ISO-8859-1"

because _Encoding now says two things (this var is a String var and the encoding is ...)
and _CharSet likewise says two things (this var has individual chars and the charset is ...).

@thehesiod
Copy link
Contributor

@BobSimons I don't think the default for NC3 can be UTF-8 because there are existing NC3 files w/o _Encoding which are not UTF-8. Existing NC3 files w/o _Encoding are ambiguous (broken) for strings due to bad spec.

@rsignell-usgs
Copy link
Author

I think what @BobSimons means is that the convention will be to assume string and UTF-8 for char arrays without any attributes because, as you say, it's ambiguous, and software will have to do something!

@BobSimons
Copy link

BobSimons commented May 9, 2017 via email

@dopplershift
Copy link
Member

If utf-8 is an option, why are we restricting the rest of the list to iso-8859-1/15?

@BobSimons
Copy link

BobSimons commented May 9, 2017 via email

@dopplershift
Copy link
Member

Well the problem with ISO-8859-1 (aka latin-1) and ISO-8859-15 (aka latin-9) is that they're distinctly focused on western European languages. We should at a minimum look at something like koi8-r and cp1251 to encompass eastern european/cyrillic characters. You should also be able to declare ascii itself to indicate that you only intend to use the lowest 7-bits.

@BobSimons
Copy link

BobSimons commented May 9, 2017 via email

@jswhit
Copy link

jswhit commented May 16, 2017

for multidimensional character arrays that are to be interpreted as strings, is there a standard way to interpret the dimensions? Should the last dimension be interpreted as the length of the strings? If so, is there a convention for naming that dimension?

@BobSimons
Copy link

BobSimons commented May 16, 2017 via email

@rsignell-usgs
Copy link
Author

@dopplershift , are you satisfied with the explanation @BobSimons provided?
I'd like to push this one to closure and not just leave it hanging...

@jswhit
Copy link

jswhit commented May 17, 2017

I added automatic detection of the _Encoding attribute in netcdf4-python (Unidata/netcdf4-python#665). For string variables, if _Encoding is set it is used to encode the strings into bytes when writing to the file, and to decode the bytes into strings when reading from the file. If _Encoding is not specified, utf-8 is used (which was the previous behavior). When reading data from character variables _Encoding is used to convert the character array to an array of fixed length strings, assuming the last dimension is the length of the strings. When writing data to character variables, _Encoding is used to encode the string arrays into bytes, creating an array of individual bytes with one more dimension. For character variables, if _Encoding is not set, an array of bytes is returned.

@BobSimons
Copy link

BobSimons commented May 17, 2017 via email

@dopplershift
Copy link
Member

@rsignell-usgs @BobSimons
Well I was thinking ascii as a nice option for "I don't care about the 8th bit", but I can see the rationale behind forcing a choice for the 8th bit--I'm just guessing most users are not going to care or even understand anything beyond ascii and are just going to pick the option that let's them write without errors. Either way, so long as we make our restricted list as inclusive as possible--was just trying to make us less US/Western Europe-centric.

@DennisHeimbigner
Copy link
Collaborator

What I wish was the case was this:

  1. char type is an alias for unsigned byte - this guarantees that 8 bits of data must always
    be preserved
  2. The _Encoding is a suggestion about how programs (ncdump, etc) should interpret
    the characters in the event that they have to print them (or read them from text).
    The same should also hold for strings in that strings are equivalent to variable length
    sequences of unsigned bytes.
    The reason I wish this were the case is that the _Encoding is AFIAK irrelevant except
    when reading or writing text.

@DennisHeimbigner
Copy link
Collaborator

Also, with respect to using the rightmost dim to encode (fixed-length) strings: this is purely
an external convention and is certainly not part of the netcdf spec. It raises a question?
Who actually makes use of this convention? I know only one place: the conversion of DAP2
string typed vars into netcdf-3 character typed variables. Is it used anywhere else?

@BobSimons
Copy link

BobSimons commented May 17, 2017 via email

@DennisHeimbigner
Copy link
Collaborator

WRT to jswhit proposal above.

  1. As part of the python strings -> netcdf-4 string translation rules, the rule for strings
    seems reasonable.
  2. The character part of that proposal is relevant only to the issue of python strings <->
    netcdf-3 char array translation rules.. It is consistent with DAP2 and CF translation rules.
  3. For translating netcdf4-char arrays to python, Use GNUInstallDirs to install into /usr/lib64 as needed #2 also seems appropriate.
    I have a question for the python people. Is there any situation in which a python
    string would be translated into a netcdf- char array? I infer that this case is prohibited
    using the jswhit rules for python <-> netcdf-4

@DennisHeimbigner
Copy link
Collaborator

At this point, there seems to be agreement about strings: _Encoding specifies
the character set and if missing, utf-8 should be assumed.

So we can focus on the character type as an eight bit value. I am not concerned here
with translation rules (e.g. python strings <-> netcdf character arrays).

1 _Encoding applies to individual 8-bit characters but the only legal _Encodings are
those that are inherently 8-bit or less: iso-8859, ascii being prevalent.
Converting a vector of such characters to a string (via some rule) should produce a
legal string of that encoding.
2. _Encoding applies to individual 8-bit characters and specify only the expected bit patterns.
This allows utf-8 _Encoding since the set of legal utf8 bit patterns are known).
Note that this does not mean that converting (via some rule) a vector of chars to a String
would necessarily produce a legitimate utf8 encoded string. the default encoding is
to allow any 8-bit pattern.
Personally I prefer #2 since it at least allows (again via some reasonable rule) to
convert a utf8 string to a vector of 8-bit characters.
Choosing #1 would preclude that possibility at all and an error would have to be thrown.

@jswhit
Copy link

jswhit commented May 17, 2017

@BobSimons, regarding your comment that the python implemention deviates from your original proposal...

In the situation when a user tries to write an array of python fixed length strings to a character variable with _Encoding set, the python interface will convert that array of fixed length strings to an array of single characters (bytes) with one more dimension (equal to the length of the fixed length strings, and the rightmost dimension of the character variable) then write that array of characters to the file.

I thought this was in the spirit of the CF convention - and this is what a user would have to do manually to write the strings to the character variable. One could certainly argue that this is a too much 'magic' though.

The same happens in reverse when data is read from a char variable with _Encoding set.

@jswhit
Copy link

jswhit commented May 17, 2017

@DennisHeimbigner, regarding your question "Is there any situation in which a python
string would be translated into a netcdf- char array?"....

The answer is yes, if you are writing a single string into a character array like this

>>> v
<type 'netCDF4._netCDF4.Variable'>
|S1 strings(n1, n2, nchar)
    _Encoding: ascii
unlimited dimensions: n1
current shape = (0, 10, 12)
filling on, default _FillValue of  used
>>> v[0,0,:] = 'foobar'

The string foobar will get converted into an array of 12 characters (with trailing blanks appended) and then written to the file resulting in

netcdf tst_stringarr {
dimensions:
	n1 = UNLIMITED ; // (1 currently)
	n2 = 10 ;
	nchar = 12 ;
variables:
	char strings(n1, n2, nchar) ;
		strings:_Encoding = "ascii" ;
data:

 strings =
  "foobar",

@BobSimons
Copy link

BobSimons commented May 17, 2017 via email

@jswhit
Copy link

jswhit commented May 17, 2017

for nc3 or nc4 files, if _Encoding is not set the individual chars will be returned by the python interface without collapsing the rightmost dimension. I presume this is the case for those ARGO files. I thought from your proposal that if _Encoding was set, then the client should interpret the char array as strings. Did I misread that?

@BobSimons
Copy link

BobSimons commented May 18, 2017 via email

@rsignell-usgs
Copy link
Author

@BobSimons, would @jswhit's approach with NetCDF-Python work for you in ERDDAP to disambiguate string and char array handling in NetCDF3 and NetCDF4?

Seems like it does, right?

@BobSimons
Copy link

BobSimons commented Jun 5, 2017 via email

@rsignell-usgs
Copy link
Author

Okay, I'll discuss with you when you get back from vacation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants