Conventions for string and character array encoding #402

rsignell-usgs · 2017-05-02T20:16:33Z

As discussed here Unidata/netcdf4-python#654 (comment), there is a need for conventions to specify the encoding of strings and character arrays in netcdf.

There is also a need to specify whether char arrays in NetCDF3 contain strings or character arrays.

@BobSimons addressed these issues in an enhancement to CF conventions that would specify charset for NetCDF3 and _Encoding for NetCDF4, and the Unidata gang (@DennisHeimbigner, @WardF, @ethanrd and @cwardgar) agreed with the concept, but suggested this be handled in the NUG and we came up with this slightly different proposal that would still accomplish Bob's goals of making it easy for software to figure out what is stuffed in those char or string arrays!

Proposal:

Use _CharType variable attribute with allowed values ['STRING', 'CHAR_ARRAY'] to specify if a char array variable should be interpreted as a string or as an array of individual characters. If _CharType is missing, default is 'STRING'.
Use _Encoding variable attribute with allowed values ['ISO-8859-1', 'ISO-8859-15', 'UTF-8'] to specify the encoding. If _Encoding is missing for _CharType='STRING', default is 'UTF-8'. If _Encoding is missing for _CharType='CHAR_ARRAY', default is 'ISO-8859-15'.

The text was updated successfully, but these errors were encountered:

ethanrd · 2017-05-02T21:18:13Z

Should an _Encoding attribute on a 'char' typed variable be restricted to a 7- or 8-bit encoding?

ethanrd · 2017-05-02T21:19:40Z

As @DennisHeimbigner mentions here Unidata/netcdf4-python#654 (comment), this proposal does deals only with char or String typed variables, not char or String typed attributes.

jswhit · 2017-05-03T14:01:45Z

Why wouldn't _Encoding apply to attributes as well as variable data?

DennisHeimbigner · 2017-05-03T15:30:58Z

Because netcdf does not support attributes for attributes. We would need to come up with some kind of convention for this: a second attribute that could be interpreted as applying to the string/char attribute. Alternate is to define a global encoding for all attributes. =Dennis Heimbigner Unidata

…

On 5/3/2017 8:01 AM, Jeff Whitaker wrote: Why wouldn't |_Encoding| apply to attributes as well as variable data? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3P2yhdO2_-QxIAH57sQUlvJMbiB3I-ks5r2IjKgaJpZM4NOoaO>.

rsignell-usgs · 2017-05-03T16:58:03Z

@DennisHeimbigner, I guessing you can answer @ethanrd question:

Should an _Encoding attribute on a char typed variable be restricted to a 7- or 8-bit encoding?

DennisHeimbigner · 2017-05-03T17:16:30Z

Should an _Encoding attribute on a char typed variable be restricted to a 7- or 8-bit encoding?

If an _Encoding is specified, then that technically determines 7 vs 8 bit. E.g. Ascii is 7 bit, but
ISO-Latin-8859-1 is 8-bit. The tricky case is when something like utf-8 encoding is specified.
Technically, the single character subset of utf-8 is 7-bit ascii. But, it is clear that some users
treat an array of chars as a string, in which case any legal utf-8 char bit pattern should be legal.
IMO, we should always treat char as essentially equivalent to unsigned byte so that a char
can hold any 8-bit bit pattern so that e.g. _Encoding = "iso-latin-8859-1" is legal and does not
lose information.

ethanrd · 2017-05-03T21:33:22Z

I suggested always indicating the encoding/charset with the same attribute (_Encoding), whether string or character, thinking it simplified things. Now that restrictions on allowed encodings has come up, I'm seeing the wisdom of @BobSimons original proposal to the CF list with one attribute that gives a string encoding for use when interpreting strings and one that gives the character set for use when interpreting individual characters. (It avoids the need for different restrictions on the value of _Encoding depending on the situation.)

So, as an alternate to the above proposal, I'll restate Bob's proposal here with a change or two given the target is the NUG rather than CF:

Use the _CharSet variable attribute to indicate that a char array should be interpreted as individual 8-bit characters. The value of the attribute gives the 8-bit character set to use when interpreting the 8-bit characters. (E.g., 'ISO-8859-15'.)
Use the _Encoding variable attribute to indicate which character encoding should be used to interpret a string variable. Used with a char array, the attribute indicates that it should be interpreted as a string (or an array of strings).

Reviewing Bob's original proposal brought up a number of questions on how the netCDF-4 and HDF5 libraries handle string encoding (if they enforce the encoding or not, etc.). I'm still digging and will report back when I get somewhere.

Also, there was some question in the CF discussion on whether an explicit indicator was needed to differentiate between whether a char array should be interpreted as individual 8-bit characters or as a string(s). Since the current proposal is suggesting a change to the NUG, I'm not sure if this question will or should play out the same as in a CF discussion.

rsignell-usgs · 2017-05-08T14:27:30Z

@lesserwhirls, do you have any thoughts here? @ethanrd are you still looking at this, or can we propose the above changes to NUG?

DennisHeimbigner · 2017-05-08T18:44:46Z

I do not understand the need for the _CharSet attribute. The type of the variable (char vs String)
and the _Encoding attribute seem to me to encompass _CharSet. That is
_Encoding for char type == CharSet

rsignell-usgs · 2017-05-08T20:12:34Z

@DennisHeimbigner, the problem is that while netcdf4 has char or string, netcdf3 has only char. So we don't know whether the netcdf3 char holds a string or an array of 8 bit characters.

BobSimons · 2017-05-08T20:18:36Z

@DennisHeimbigner, this is an alternative proposal.
In this proposal _CharSet and _Encoding apply to different situations, have different options, and are used differently (mandatory vs optional):

_CharSet would be for char variables when they are to be interpreted as individual chars.
The options are ISO-8859-1 and ISO-8859-15.
_CharSet is mandatory if the chars should be interpreted as individual chars.

_Encoding would be for String variables (e.g., in nc4) and char variables in nc3 which should be interpreted as Strings.
The options are ISO-8859-1, ISO-8859-15, and UTF-8. [Different!]
_Encoding is optional. The default is UTF-8. [Different!]

A further advantage is that only one attribute is needed per variable, not two.

Think of it from a software reader's point of view:

Is there a _CharSet attribute? Then these are chars and I now know the charset.
Is there an _Encoding attribute? Then these are strings and I now know the encoding.
Is there neither? Then these are UTF-8 strings.

DennisHeimbigner · 2017-05-08T20:30:22Z

I think the term "mandatory" is being misused here since a default is defined.
But the real issue in Bob's proposal has to do with whether a character typed
variable (or attribute?) is to be treated as if it was a surrogate for the lack of
a String type in netcdf-3 and do we want a special attribute to mark that case.
Personally undecided on that issue.

DennisHeimbigner · 2017-05-08T20:33:34Z

One other question. If we had an attribute to indicate that a char array
should be treated like a string, do we want to limit the use of that
attribute to netcdf-3 only. Since netcdf-4 has string type, that attribute
is technically not needed.

ethanrd · 2017-05-08T21:49:52Z

@BobSimons Given the backward compatibility issues, I'm not sure the NUG should specify how character arrays are interpreted when the proposed attributes are not used. At least not at the level of a MUST.

rsignell-usgs · 2017-05-09T12:25:21Z

The proposed default behavior is to assume that a netcdf3 char array is a string.
With Bob's proposal, if a_CharSet attribute is found, we know it's not a string.

BobSimons · 2017-05-09T15:10:16Z

With the original proposal, an nc3 file might have:

  someMonths(a=5, b=10)
    _CharType="STRING"
    _Encoding="UTF-8"
  someStatus(c=4, d=2)
    _CharType="CHAR_ARRAY"
    _Encoding="ISO-8859-1"

With the alternative proposal, that nc3 file would have:

  someMonths(a=5, b=10)
    _Encoding="UTF-8"
  someStatus(c=4, d=2)
    _CharSet="ISO-8859-1"

because _Encoding now says two things (this var is a String var and the encoding is ...)
and _CharSet likewise says two things (this var has individual chars and the charset is ...).

thehesiod · 2017-05-09T19:12:34Z

@BobSimons I don't think the default for NC3 can be UTF-8 because there are existing NC3 files w/o _Encoding which are not UTF-8. Existing NC3 files w/o _Encoding are ambiguous (broken) for strings due to bad spec.

rsignell-usgs · 2017-05-09T19:15:45Z

I think what @BobSimons means is that the convention will be to assume string and UTF-8 for char arrays without any attributes because, as you say, it's ambiguous, and software will have to do something!

BobSimons · 2017-05-09T19:44:12Z

And I leave it to everyone else to say what the default should be. There are advantages and disadvantages to every choice. ISO-8859-1 probably makes sense from a safe, backward-looking sense. UTF-8 would be nice in a forward-looking sense.

…

On Tue, May 9, 2017 at 12:15 PM, Rich Signell ***@***.***> wrote: I think what @BobSimons <https://github.com/BobSimons> means is that the convention will be to assume string and UTF-8 for char arrays without any attributes because, as you say, it's ambiguous, and software will have to do something! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABarOJxsouo4DKsLxduKi4Jw0rrQI2Gbks5r4LtjgaJpZM4NOoaO> .

-- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: [email protected] The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

dopplershift · 2017-05-09T20:04:17Z

If utf-8 is an option, why are we restricting the rest of the list to iso-8859-1/15?

BobSimons · 2017-05-09T21:39:29Z

Under either proposal, for char variables that will be interpreted as individual characters (which will be stored as individual bytes in .nc files), UTF-8 isn't and can't be an option because most UTF-8 characters are represented as more than one byte. Under either proposal, for char variables that will be interpreted as Strings, UTF-8 is a valid option. That said, why just 2 other options? It can't be open-ended because then all software which tries to read a .nc file is responsible for being ready to read every possible encoding. There's a question of what are the "correct" or at least valid names -- different systems seem to use slightly different names. Different computer languages support different options. So there needs to be a defined list of acceptable options. Right now, that list is short. ISO-8859-1 is nice because it is the same as the first 256 characters of Unicode. So it is the closest to what netcdf library has been doing when writing just the low byte of a Unicode character. ISO-8859-1 has been widely used. ISO-8859-15 is nice because it is the modern version of ISO-8859-1. ISO-8859-15 has been fairly widely used. Support for options other than UTF-8 is a way of dealing with legacy files. There are millions (billions?) of .nc files that aren't going to be re-written, so it would be nice if there were a way to specify the encoding if it is known. If it is known, it could be specified by adding an, e.g., _Encoding attribute with NCO or on-the-fly with NCML without having to write a program to read the file and write the file out with the attribute specifying the encoding. I personally am open to allowing other options if the need arises, But I don't know which other options are needed. If others are added, we need to agree on the specific names.

…

On Tue, May 9, 2017 at 1:04 PM, Ryan May ***@***.***> wrote: If utf-8 is an option, why are we restricting the rest of the list to iso-8859-1/15? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABarOPt83LfG-df6K_lr_lMzAHArBl46ks5r4MbCgaJpZM4NOoaO> .

-- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: [email protected] The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

dopplershift · 2017-05-09T21:46:28Z

Well the problem with ISO-8859-1 (aka latin-1) and ISO-8859-15 (aka latin-9) is that they're distinctly focused on western European languages. We should at a minimum look at something like koi8-r and cp1251 to encompass eastern european/cyrillic characters. You should also be able to declare ascii itself to indicate that you only intend to use the lowest 7-bits.

BobSimons · 2017-05-09T21:57:06Z

Every charset has a different focus. I suggested ISO-8859-1 and -15 because I know they have been widely used. If you know of files that use koi8-r and cp1251, then let's add them to the list of acceptable charsets/encodings. I don't like the idea of allowing an ASCII (7-bit) option because the data is 8-bits. A reader has to be ready to deal with 8-bit data. (Or we could say that ASCII is a valid option but if the file has a character using the 8th bit, the file is invalid. I suspect we would get a lot of invalid files from non-ASCII apostrophes and hyphens that the file authors aren't even aware of.) I also don't see the need for an ASCII option because, if the author really believes the characters are all ASCII, then ISO-8859-1 can be specified (since the first 128 chars are the same).

…

On Tue, May 9, 2017 at 2:46 PM, Ryan May ***@***.***> wrote: Well the problem with ISO-8859-1 (aka latin-1) and ISO-8859-15 (aka latin-9) is that they're distinctly focused on western European languages. We should at a minimum look at something like koi8-r and cp1251 to encompass eastern european/cyrillic characters. You should also be able to declare ascii itself to indicate that you only intend to use the lowest 7-bits. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABarOM4wyqAA0LWh7tNd-oYMRudXDGwfks5r4N61gaJpZM4NOoaO> .

-- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: [email protected] The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

jswhit · 2017-05-16T17:02:47Z

for multidimensional character arrays that are to be interpreted as strings, is there a standard way to interpret the dimensions? Should the last dimension be interpreted as the length of the strings? If so, is there a convention for naming that dimension?

BobSimons · 2017-05-16T17:12:29Z

"Yes" for your first two questions: CF 1.6 (and previous) section 2.2 http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_data_types says that is always the last dimension that holds the number of characters "NetCDF does not support a character string type, so these must be represented as character arrays. In this document, a one dimensional array of character data is simply referred to as a "string". An n-dimensional array of strings must be implemented as a character array of dimension (n,max_string_length), with the last (most rapidly varying) dimension declared large enough to contain the longest string in the array. All the strings in a given array are therefore defined to be equal in length. For example, an array of strings containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name." "No" for your third question: As far as I know, there is no standard for how that dimension should be named. CF section 2.3 says "This convention does not standardize any variable or dimension names. "

…

On Tue, May 16, 2017 at 10:02 AM, Jeff Whitaker ***@***.***> wrote: for multidimensional character arrays that are to be interpreted as strings, is there a standard way to interpret the dimensions? Should the last dimension be interpreted as the length of the strings? If so, is there a convention for naming that dimension? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABarOBxdV5--h9sFvQr0O2SK2gAUvWuFks5r6da6gaJpZM4NOoaO> .

-- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: [email protected] The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

rsignell-usgs · 2017-05-17T11:49:53Z

@dopplershift , are you satisfied with the explanation @BobSimons provided?
I'd like to push this one to closure and not just leave it hanging...

jswhit · 2017-05-17T15:49:57Z

I added automatic detection of the _Encoding attribute in netcdf4-python (Unidata/netcdf4-python#665). For string variables, if _Encoding is set it is used to encode the strings into bytes when writing to the file, and to decode the bytes into strings when reading from the file. If _Encoding is not specified, utf-8 is used (which was the previous behavior). When reading data from character variables _Encoding is used to convert the character array to an array of fixed length strings, assuming the last dimension is the length of the strings. When writing data to character variables, _Encoding is used to encode the string arrays into bytes, creating an array of individual bytes with one more dimension. For character variables, if _Encoding is not set, an array of bytes is returned.

BobSimons · 2017-05-17T16:29:49Z

This seems to be significantly different from the original proposal or the alternate proposal. "When writing data to character variables, _Encoding is used to encode the string arrays into bytes, creating an array of individual characters with one more dimension. For character variables, if _Encoding is not set, an array of characters is returned." I'm confused. Since netcdf4 has separate char and String data types, why are you adding a dimension when writing chars to a char variable? Is this your way of allowing chars in a char variable to be encoded with UTF-8 (and thus perhaps take up multiple bytes / char)? That would expand the usage of chars significantly. And when reading a char variable from an nc4 file, won't an array of chars always be returned? (Or, again, is this your way of expanding the usage of chars to include UTF-8 encoding?) And can't _Encoding be used to indicate the charset of the returned characters (e.g., ISO-8859-1)?

…

--- This usage seems oriented to just reading and writing netcdf-4 files. It doesn't solve the problem of how to interpret a char variable in a netcdf-3 file (as strings? as separate chars?). One of the complaints in the CF discussion was: someone writing code to read a file shouldn't have to know whether they are reading an nc3 file or an nc4 file in order to know how to interpret the data. It would be nice to have a system that works with nc3 and nc4 files.

On Wed, May 17, 2017 at 8:49 AM, Jeff Whitaker ***@***.***> wrote: I added automatic detection of the _Encoding attribute in netcdf4-python ( Unidata/netcdf4-python#665 <Unidata/netcdf4-python#665>). For string variables, if _Encoding is set it is used to encode the strings into bytes when writing to the file, and to decode the bytes into strings when reading from the file. If _Encoding is not specified, utf-8 is used (which was the previous behavior). When reading data from character variables _Encoding is used to convert the character array to an array of fixed length strings, assuming the last dimension is the length of the strings. When writing data to character variables, _Encoding is used to encode the string arrays into bytes, creating an array of individual characters with one more dimension. For character variables, if _Encoding is not set, an array of characters is returned. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABarOGWkde0uKwdnXJdxGsaZ0Sez40vlks5r6xcmgaJpZM4NOoaO> .

-- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: [email protected] The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

dopplershift · 2017-05-17T16:31:35Z

@rsignell-usgs @BobSimons
Well I was thinking ascii as a nice option for "I don't care about the 8th bit", but I can see the rationale behind forcing a choice for the 8th bit--I'm just guessing most users are not going to care or even understand anything beyond ascii and are just going to pick the option that let's them write without errors. Either way, so long as we make our restricted list as inclusive as possible--was just trying to make us less US/Western Europe-centric.

DennisHeimbigner · 2017-05-17T18:22:39Z

What I wish was the case was this:

char type is an alias for unsigned byte - this guarantees that 8 bits of data must always
be preserved
The _Encoding is a suggestion about how programs (ncdump, etc) should interpret
the characters in the event that they have to print them (or read them from text).
The same should also hold for strings in that strings are equivalent to variable length
sequences of unsigned bytes.
The reason I wish this were the case is that the _Encoding is AFIAK irrelevant except
when reading or writing text.

DennisHeimbigner · 2017-05-17T18:26:17Z

Also, with respect to using the rightmost dim to encode (fixed-length) strings: this is purely
an external convention and is certainly not part of the netcdf spec. It raises a question?
Who actually makes use of this convention? I know only one place: the conversion of DAP2
string typed vars into netcdf-3 character typed variables. Is it used anywhere else?

BobSimons · 2017-05-17T18:51:47Z

--- External? I've always been confused about the relationship of netcdf and CF so I don't know if you consider CF external, but using the rightmost dim to encode strings in char variables is part of the CF specification (section 2.2).

…

--- Where is this relevant? Doesn't netcdf-java always use the rightmost dim when you use NetcdfFileWriter.addStringVariable() and NetcdfFileWriter.writeStringData() when writing an nc3 file? And doesn't it use the rightmost dim when you use NetcdfFile.read(), readData(), and readSection()? (When reading nc3 files, do those/how do those distinguish char variables that should be read as individual chars from char variables that should be read as Strings?) Doesn't netcdf-c do the same? Some other software (e.g., some of mine) also uses the rightmost dimension system explicitly in places that were written before (or before my awareness of) writeStringData().

On Wed, May 17, 2017 at 11:26 AM, DennisHeimbigner ***@***.*** > wrote: Also, with respect to using the rightmost dim to encode (fixed-length) strings: this is purely an external convention and is certainly not part of the netcdf spec. It raises a question? Who actually makes use of this convention? I know only one place: the conversion of DAP2 string typed vars into netcdf-3 character typed variables. Is it used anywhere else? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABarOHFtmcd4UftG-NkFEHtbHcYanZGCks5r6zvKgaJpZM4NOoaO> .

-- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: [email protected] The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

DennisHeimbigner · 2017-05-17T19:56:31Z

WRT to jswhit proposal above.

As part of the python strings -> netcdf-4 string translation rules, the rule for strings
seems reasonable.
The character part of that proposal is relevant only to the issue of python strings <->
netcdf-3 char array translation rules.. It is consistent with DAP2 and CF translation rules.
For translating netcdf4-char arrays to python, Use GNUInstallDirs to install into /usr/lib64 as needed #2 also seems appropriate.
I have a question for the python people. Is there any situation in which a python
string would be translated into a netcdf- char array? I infer that this case is prohibited
using the jswhit rules for python <-> netcdf-4

DennisHeimbigner · 2017-05-17T20:19:44Z

At this point, there seems to be agreement about strings: _Encoding specifies
the character set and if missing, utf-8 should be assumed.

So we can focus on the character type as an eight bit value. I am not concerned here
with translation rules (e.g. python strings <-> netcdf character arrays).

1 _Encoding applies to individual 8-bit characters but the only legal _Encodings are
those that are inherently 8-bit or less: iso-8859, ascii being prevalent.
Converting a vector of such characters to a string (via some rule) should produce a
legal string of that encoding.
2. _Encoding applies to individual 8-bit characters and specify only the expected bit patterns.
This allows utf-8 _Encoding since the set of legal utf8 bit patterns are known).
Note that this does not mean that converting (via some rule) a vector of chars to a String
would necessarily produce a legitimate utf8 encoded string. the default encoding is
to allow any 8-bit pattern.
Personally I prefer #2 since it at least allows (again via some reasonable rule) to
convert a utf8 string to a vector of 8-bit characters.
Choosing #1 would preclude that possibility at all and an error would have to be thrown.

jswhit · 2017-05-17T22:40:00Z

@BobSimons, regarding your comment that the python implemention deviates from your original proposal...

In the situation when a user tries to write an array of python fixed length strings to a character variable with _Encoding set, the python interface will convert that array of fixed length strings to an array of single characters (bytes) with one more dimension (equal to the length of the fixed length strings, and the rightmost dimension of the character variable) then write that array of characters to the file.

I thought this was in the spirit of the CF convention - and this is what a user would have to do manually to write the strings to the character variable. One could certainly argue that this is a too much 'magic' though.

The same happens in reverse when data is read from a char variable with _Encoding set.

jswhit · 2017-05-17T22:48:20Z

@DennisHeimbigner, regarding your question "Is there any situation in which a python
string would be translated into a netcdf- char array?"....

The answer is yes, if you are writing a single string into a character array like this

>>> v
<type 'netCDF4._netCDF4.Variable'>
|S1 strings(n1, n2, nchar)
    _Encoding: ascii
unlimited dimensions: n1
current shape = (0, 10, 12)
filling on, default _FillValue of  used
>>> v[0,0,:] = 'foobar'

The string foobar will get converted into an array of 12 characters (with trailing blanks appended) and then written to the file resulting in

netcdf tst_stringarr {
dimensions:
	n1 = UNLIMITED ; // (1 currently)
	n2 = 10 ;
	nchar = 12 ;
variables:
	char strings(n1, n2, nchar) ;
		strings:_Encoding = "ascii" ;
data:

 strings =
  "foobar",

BobSimons · 2017-05-17T23:04:48Z

Your approach is internally consistent -- if someone writes files with your system and reads them with your system, all is well. But there are other nc files created by other software, which I think don't mesh with your approach. I don't know if your system is for netcdf-4 only, but if netcdf-3 files are included, the problem is: there are nc3 files with char variables where the chars are meant to be read as individual chars without collapsing the rightmost dimension. The Argo program has 100's of 1000's (millions?) of these files. They have variables like char POSITION_QC(N_PROF=254); where there is one QC character per profile. (Yes, there's a more CF-way to do this now, but they started doing this many years ago.) I think it is a reasonable reading of the CF convention (section 2.2) to say that these are legit char variables, not to be interpreted as Strings (by collapsing the rightmost dimension). A goal of this proposal is to make it simple for a software reader to read a file (including an Argo file) and know quickly and easily if a given char variable in an nc3 file is meant to be interpreted as individual chars (not collapsing the rightmost dimension) or as Strings (by collapsing the rightmost dimension). With nc4 files that is trivial because there are explicit char and String data types. The problem is with disambiguating char variables in nc3 files. Stated another way, it is a goal that netcdf-java library's NetcdfFile.read() should be able to know quickly and easily whether it should return an ArrayChar (by not collapsing the rightmost dimension) or an ArrayString (by collapsing the rightmost dimension) (and also be able to properly deal with the charset/encoding of the stored characters).

…

On Wed, May 17, 2017 at 3:40 PM, Jeff Whitaker ***@***.***> wrote: @BobSimons <https://github.com/bobsimons>, regarding your comment that the python implemention deviates from your original proposal... In the situation when a user tries to write an array of python fixed length strings to a character variable with _Encoding set, the python interface will convert that array of fixed length strings to an array of single characters (bytes) with one more dimension (equal to the length of the fixed length strings, and the rightmost dimension of the character variable) then write that array of characters to the file. I thought this was in the spirit of the CF convention - and this is what a user would have to do manually to write the strings to the character variable. One could certainly argue that this is a too much 'magic' though. The same happens in reverse when data is read from a char variable with _Encoding set. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABarOAqdua1ezYBcjktrLDRQH7DevGtTks5r63dBgaJpZM4NOoaO> .

-- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: [email protected] The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

jswhit · 2017-05-17T23:46:41Z

for nc3 or nc4 files, if _Encoding is not set the individual chars will be returned by the python interface without collapsing the rightmost dimension. I presume this is the case for those ARGO files. I thought from your proposal that if _Encoding was set, then the client should interpret the char array as strings. Did I misread that?

BobSimons · 2017-05-18T14:34:05Z

Ah. Thank you. I misunderstood.

…

On Wed, May 17, 2017 at 4:46 PM, Jeff Whitaker ***@***.***> wrote: for nc3 or nc4 files, if _Encoding is not set the individual chars will be returned by the python interface without collapsing the rightmost dimension. I thought from your proposal that if _Encoding was set, then the client should interpret the char array as strings. Did I misread that? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABarOD3-7mfaiaH_o3mt7JsVy0ZDdhNEks5r64bjgaJpZM4NOoaO> .

-- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: [email protected] The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

rsignell-usgs · 2017-06-04T16:25:14Z

@BobSimons, would @jswhit's approach with NetCDF-Python work for you in ERDDAP to disambiguate string and char array handling in NetCDF3 and NetCDF4?

Seems like it does, right?

BobSimons · 2017-06-05T19:41:06Z

Sorry. I'm on vacation for the next 2 weeks and not available to evaluate this. I was confused by his original email. So I don't think I understand his proposal. I stand by my proposal.

…

On Sun, Jun 4, 2017 at 12:25 PM, Rich Signell ***@***.***> wrote: @BobSimons <https://github.com/bobsimons>, would @jswhit <https://github.com/jswhit>'s approach with NetCDF-Python work for you in ERDDAP to disambiguate string and char array handling in NetCDF3 and NetCDF4? Seems like it does, right? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#402 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABarOIukT81RvoM10PX0NCVIdr3y1o_Oks5sAtprgaJpZM4NOoaO> .

-- Sincerely, Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: [email protected] The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

rsignell-usgs · 2017-06-05T21:26:35Z

Okay, I'll discuss with you when you get back from vacation.

jswhit mentioned this issue May 9, 2017

Add per Dataset encoding support Unidata/netcdf4-python#655

Closed

jswhit mentioned this issue May 17, 2017

recognize _Encoding attribute for char and string arrays Unidata/netcdf4-python#665

Merged

This was referenced Oct 21, 2017

fix to_netcdf append bug (GH1215) pydata/xarray#1609

Merged

Unicode strings unexpectedly transformed to byte strings upon open_dataset pydata/xarray#1638

Closed

lesserwhirls mentioned this issue Jan 30, 2020

Feature supporting of different charsets Unidata/netcdf-java#184

Merged

jswhit mentioned this issue Mar 29, 2022

add "_Encoding" attribute to time_iso history file variable NOAA-EMC/fv3atm#515

Closed

ChrisBarker-NOAA mentioned this issue Sep 18, 2024

Add support for attributes of type string cf-convention/cf-conventions#141

Closed

Conventions for string and character array encoding #402

Conventions for string and character array encoding #402

Comments

rsignell-usgs commented May 2, 2017 • edited Loading

ethanrd commented May 2, 2017

ethanrd commented May 2, 2017

jswhit commented May 3, 2017

DennisHeimbigner commented May 3, 2017 via email

rsignell-usgs commented May 3, 2017

DennisHeimbigner commented May 3, 2017 • edited by dopplershift Loading

ethanrd commented May 3, 2017

rsignell-usgs commented May 8, 2017

DennisHeimbigner commented May 8, 2017

rsignell-usgs commented May 8, 2017 • edited Loading

BobSimons commented May 8, 2017

DennisHeimbigner commented May 8, 2017

DennisHeimbigner commented May 8, 2017

ethanrd commented May 8, 2017

rsignell-usgs commented May 9, 2017

BobSimons commented May 9, 2017

thehesiod commented May 9, 2017

rsignell-usgs commented May 9, 2017

BobSimons commented May 9, 2017 via email

dopplershift commented May 9, 2017

BobSimons commented May 9, 2017 via email

dopplershift commented May 9, 2017

BobSimons commented May 9, 2017 via email

jswhit commented May 16, 2017

BobSimons commented May 16, 2017 via email

rsignell-usgs commented May 17, 2017

jswhit commented May 17, 2017 • edited Loading

BobSimons commented May 17, 2017 via email

dopplershift commented May 17, 2017

DennisHeimbigner commented May 17, 2017

DennisHeimbigner commented May 17, 2017

BobSimons commented May 17, 2017 via email

DennisHeimbigner commented May 17, 2017

DennisHeimbigner commented May 17, 2017

jswhit commented May 17, 2017

jswhit commented May 17, 2017 • edited Loading

BobSimons commented May 17, 2017 via email

jswhit commented May 17, 2017 • edited Loading

BobSimons commented May 18, 2017 via email

rsignell-usgs commented Jun 4, 2017

BobSimons commented Jun 5, 2017 via email

rsignell-usgs commented Jun 5, 2017

rsignell-usgs commented May 2, 2017 •

edited

Loading

DennisHeimbigner commented May 3, 2017 •

edited by dopplershift

Loading

rsignell-usgs commented May 8, 2017 •

edited

Loading

jswhit commented May 17, 2017 •

edited

Loading

jswhit commented May 17, 2017 •

edited

Loading

jswhit commented May 17, 2017 •

edited

Loading