-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roundtrip unicode strings even when written as character arrays #1648
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer - looks pretty good. I had a few small comments. I should say, I tend to gloss over pretty quickly when looking at Python's string encoding stuff so hopefully someone else can review some of the nitty-gritty logic/tests.
doc/io.rst
Outdated
|
||
xarray can write unicode strings to netCDF files in two ways: | ||
|
||
- As variable lengths strings. This is only supported on netCDF4 (HDF5) files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use the nc4/hdf lingo, perhaps "variable length arrays" is clearer. Also, typo in "lengths".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
netCDF4-python refers to "VLEN strings" in a section entitled "Variable length (VLEN) types": http://unidata.github.io/netcdf4-python/#section11
and h5py talks about "variable-length UTF-8":
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
These both sound like variable length strings to me.
(I fixed the typo.)
xarray/backends/h5netcdf_.py
Outdated
'(https://github.com/Unidata/netcdf4-python/issues/730). ' | ||
"Either remove '_FillValue' from encoding on variable %r " | ||
"or set {'dtype': 'S1'} in encoding to use the fixed width " | ||
'NC_CHAR type.' % name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is raised in the h5netcdf backend. It may be confusing to get an error that says, "netCDF4 doesn't support what you're trying to do." If you think it is important to call out netCDF4 here, perhaps lump both backends together, i.e. "the h5netcdf/netCDF4 backends do not yet support..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed to refer to an h5netcdf issue instead.
@@ -1389,6 +1408,10 @@ def roundtrip(self, data, save_kwargs={}, open_kwargs={}, | |||
allow_cleanup_failure=False): | |||
yield data.chunk() | |||
|
|||
def test_roundtrip_string_encoded_characters(self): | |||
# Override method in DatasetIOTestCases - not applicable to dask | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use pytest.skip()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a skipped test, so much as a test that really shouldn't exist at all :). We slightly abuse the notion of a "backend" for dask here.
xarray/tests/test_backends.py
Outdated
with self.roundtrip(original) as actual: | ||
self.assertDatasetIdentical(expected, actual) | ||
except NotImplementedError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we use pytest.skip()
here and the pass statement above.
e854bdc
to
f8142a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned up my _FillValue
tests to explicit check for errors -- that seems like a better approach for now, rather than marking them as expected failures or skipping them.
doc/io.rst
Outdated
|
||
xarray can write unicode strings to netCDF files in two ways: | ||
|
||
- As variable lengths strings. This is only supported on netCDF4 (HDF5) files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
netCDF4-python refers to "VLEN strings" in a section entitled "Variable length (VLEN) types": http://unidata.github.io/netcdf4-python/#section11
and h5py talks about "variable-length UTF-8":
http://docs.h5py.org/en/latest/strings.html#variable-length-utf-8
These both sound like variable length strings to me.
(I fixed the typo.)
xarray/backends/h5netcdf_.py
Outdated
'(https://github.com/Unidata/netcdf4-python/issues/730). ' | ||
"Either remove '_FillValue' from encoding on variable %r " | ||
"or set {'dtype': 'S1'} in encoding to use the fixed width " | ||
'NC_CHAR type.' % name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Fixed to refer to an h5netcdf issue instead.
@@ -1389,6 +1408,10 @@ def roundtrip(self, data, save_kwargs={}, open_kwargs={}, | |||
allow_cleanup_failure=False): | |||
yield data.chunk() | |||
|
|||
def test_roundtrip_string_encoded_characters(self): | |||
# Override method in DatasetIOTestCases - not applicable to dask | |||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is a skipped test, so much as a test that really shouldn't exist at all :). We slightly abuse the notion of a "backend" for dask here.
@jhamman This reminds me of myself three years ago when I wrote these original messy tests! The mostly but not entirely compatible semantics of Python 3, NumPy, netCDF and HDF5 makes this very hard to get right. |
Unicode strings (
str
on Python 3) are now round-tripped successfully even when written as character arrays (e.g., as netCDF3 files or when usingengine='scipy'
). This is controlled by the_Encoding
attribute convention, which is also understood directly by the netCDF4-Python interface.This PR also resolves some long-standing technical debt in the test suite related to the hacky use of
decode_bytes
inassert_allclose
(recently encountered by @jhamman in #1609). Once we're sure that we don't need it anymore, I'd like to deprecate and eventually remove thedecode_bytes
option.Note that there are still a few unresolved issues with regards to serializing missing values in strings, so I've intentionally held off on documenting the handling of
_FillValue
for now. I'd like to resolve those separately after discussion in #1647, but ideally this could make it in for the v0.10 release.open_dataset
#1638git diff upstream/master | flake8 --diff
whats-new.rst
for all changes andapi.rst
for new API