Roundtrip unicode strings even when written as character arrays #1648

shoyer · 2017-10-23T07:42:04Z

Unicode strings (str on Python 3) are now round-tripped successfully even when written as character arrays (e.g., as netCDF3 files or when using engine='scipy'). This is controlled by the _Encoding attribute convention, which is also understood directly by the netCDF4-Python interface.

This PR also resolves some long-standing technical debt in the test suite related to the hacky use of decode_bytes in assert_allclose (recently encountered by @jhamman in #1609). Once we're sure that we don't need it anymore, I'd like to deprecate and eventually remove the decode_bytes option.

Note that there are still a few unresolved issues with regards to serializing missing values in strings, so I've intentionally held off on documenting the handling of _FillValue for now. I'd like to resolve those separately after discussion in #1647, but ideally this could make it in for the v0.10 release.

Closes Unicode strings unexpectedly transformed to byte strings upon open_dataset #1638
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

jhamman

@shoyer - looks pretty good. I had a few small comments. I should say, I tend to gloss over pretty quickly when looking at Python's string encoding stuff so hopefully someone else can review some of the nitty-gritty logic/tests.

jhamman · 2017-10-23T14:52:46Z

doc/io.rst

+
+xarray can write unicode strings to netCDF files in two ways:
+
+- As variable lengths strings. This is only supported on netCDF4 (HDF5) files.


To use the nc4/hdf lingo, perhaps "variable length arrays" is clearer. Also, typo in "lengths".

netCDF4-python refers to "VLEN strings" in a section entitled "Variable length (VLEN) types": http://unidata.github.io/netcdf4-python/#section11

and h5py talks about "variable-length UTF-8":
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8

These both sound like variable length strings to me.

(I fixed the typo.)

jhamman · 2017-10-23T14:57:35Z

xarray/backends/h5netcdf_.py

+                '(https://github.com/Unidata/netcdf4-python/issues/730). '
+                "Either remove '_FillValue' from encoding on variable %r "
+                "or set {'dtype': 'S1'} in encoding to use the fixed width "
+                'NC_CHAR type.' % name)


This error is raised in the h5netcdf backend. It may be confusing to get an error that says, "netCDF4 doesn't support what you're trying to do." If you think it is important to call out netCDF4 here, perhaps lump both backends together, i.e. "the h5netcdf/netCDF4 backends do not yet support..."

Good point. Fixed to refer to an h5netcdf issue instead.

jhamman · 2017-10-23T15:41:14Z

xarray/tests/test_backends.py

@@ -1389,6 +1408,10 @@ def roundtrip(self, data, save_kwargs={}, open_kwargs={},
                  allow_cleanup_failure=False):
        yield data.chunk()

+    def test_roundtrip_string_encoded_characters(self):
+        # Override method in DatasetIOTestCases - not applicable to dask
+        pass


should we use pytest.skip() here?

I'm not sure this is a skipped test, so much as a test that really shouldn't exist at all :). We slightly abuse the notion of a "backend" for dask here.

jhamman · 2017-10-23T15:41:34Z

xarray/tests/test_backends.py

+            with self.roundtrip(original) as actual:
+                self.assertDatasetIdentical(expected, actual)
+        except NotImplementedError:
+            pass


should we use pytest.skip() here and the pass statement above.

…ings)

shoyer

I cleaned up my _FillValue tests to explicit check for errors -- that seems like a better approach for now, rather than marking them as expected failures or skipping them.

shoyer · 2017-10-24T04:45:41Z

doc/io.rst

+
+xarray can write unicode strings to netCDF files in two ways:
+
+- As variable lengths strings. This is only supported on netCDF4 (HDF5) files.


netCDF4-python refers to "VLEN strings" in a section entitled "Variable length (VLEN) types": http://unidata.github.io/netcdf4-python/#section11

and h5py talks about "variable-length UTF-8":
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8

These both sound like variable length strings to me.

(I fixed the typo.)

shoyer · 2017-10-24T06:49:56Z

xarray/backends/h5netcdf_.py

+                '(https://github.com/Unidata/netcdf4-python/issues/730). '
+                "Either remove '_FillValue' from encoding on variable %r "
+                "or set {'dtype': 'S1'} in encoding to use the fixed width "
+                'NC_CHAR type.' % name)


Good point. Fixed to refer to an h5netcdf issue instead.

shoyer · 2017-10-24T16:23:10Z

xarray/tests/test_backends.py

@@ -1389,6 +1408,10 @@ def roundtrip(self, data, save_kwargs={}, open_kwargs={},
                  allow_cleanup_failure=False):
        yield data.chunk()

+    def test_roundtrip_string_encoded_characters(self):
+        # Override method in DatasetIOTestCases - not applicable to dask
+        pass


I'm not sure this is a skipped test, so much as a test that really shouldn't exist at all :). We slightly abuse the notion of a "backend" for dask here.

shoyer · 2017-10-24T16:30:14Z

I should say, I tend to gloss over pretty quickly when looking at Python's string encoding stuff so hopefully someone else can review some of the nitty-gritty logic/tests.

@jhamman This reminds me of myself three years ago when I wrote these original messy tests! The mostly but not entirely compatible semantics of Python 3, NumPy, netCDF and HDF5 makes this very hard to get right.

…tests

shoyer added 3 commits October 21, 2017 22:22

Decoding support for conventions.CharToStringArray

f11931d

Round unicode strings to disk even with NC_CHAR by using utf-8 encoding

8632cd7

Docs on string encoding

f2a4fa4

shoyer mentioned this pull request Oct 23, 2017

Representing missing values in string arrays on disk #1647

Closed

shoyer requested a review from jhamman October 23, 2017 07:45

jhamman reviewed Oct 23, 2017

View reviewed changes

Fixup per review (and avoid supporting _FillValue yet for unicode str…

f8142a3

…ings)

shoyer force-pushed the ncchar-string-encoding branch from e854bdc to f8142a3 Compare October 24, 2017 16:14

clarify warning

2e8f554

shoyer commented Oct 24, 2017

View reviewed changes

shoyer added 5 commits October 24, 2017 22:10

Merge branch 'master' into ncchar-string-encoding

cdaace3

Use assert_identical instead of assert_allclose for roundtrip_append …

f918251

…tests

Fix tests; add variable name to more errors

b29aad6

Merge branch 'master' into ncchar-string-encoding

d6694e2

add back in dropped @required_scipy decorator

72ff454

shoyer mentioned this pull request Oct 27, 2017

v0.10 Release #1535

Closed

13 tasks

shoyer merged commit 6390230 into pydata:master Oct 27, 2017

shoyer deleted the ncchar-string-encoding branch October 27, 2017 23:36

olgabot mentioned this pull request Nov 1, 2017

Unicode strings unexpectedly transformed to byte strings upon open_dataset #1638

Closed

delgadom mentioned this pull request Dec 29, 2017

Handle _FillValue in variable-length unicode string variables #1802

Closed

5 tasks

shoyer mentioned this pull request Apr 6, 2018

to_netcdf() to automatically switch to fixed-length strings for compressed variables #2040

Closed

keewis mentioned this pull request Apr 4, 2021

Saving to netCDF with 0D dimension doesn't work #1352

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roundtrip unicode strings even when written as character arrays #1648

Roundtrip unicode strings even when written as character arrays #1648

shoyer commented Oct 23, 2017

jhamman left a comment

jhamman Oct 23, 2017

shoyer Oct 24, 2017

jhamman Oct 23, 2017

shoyer Oct 24, 2017

jhamman Oct 23, 2017

shoyer Oct 24, 2017

jhamman Oct 23, 2017

shoyer left a comment

shoyer Oct 24, 2017

shoyer Oct 24, 2017

shoyer Oct 24, 2017

shoyer commented Oct 24, 2017


		xarray can write unicode strings to netCDF files in two ways:

		- As variable lengths strings. This is only supported on netCDF4 (HDF5) files.

Roundtrip unicode strings even when written as character arrays #1648

Roundtrip unicode strings even when written as character arrays #1648

Conversation

shoyer commented Oct 23, 2017

jhamman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Oct 24, 2017