diff --git a/doc/io.rst b/doc/io.rst index 192890e112a..b6652ab5ef1 100644 --- a/doc/io.rst +++ b/doc/io.rst @@ -266,6 +266,39 @@ converting ``NaN`` to ``-9999``, we would use ``encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue': -9999}}``. Compression and decompression with such discretization is extremely fast. +.. _io.string-encoding: + +String encoding +............... + +xarray can write unicode strings to netCDF files in two ways: + +- As variable lengths strings. This is only supported on netCDF4 (HDF5) files. +- By encoding strings into bytes, and writing encoded bytes as a character + array. The default encoding is UTF-8. + +By default, we use variable length strings for compatible files and fall-back +to using encoded character arrays. Character arrays can be selected even for +netCDF4 files by setting the ``dtype`` field in ``encoding`` to ``S1`` +(corresponding to NumPy's single-character bytes dtype). + +If character arrays are used, the string encoding that was used is stored on +disk in the ``_Encoding`` attribute, which matches an ad-hoc convention +`adopted by the netCDF4-Python library `_. +At the time of this writing (October 2017), a standard convention for indicating +string encoding for character arrays in netCDF files was +`still under discussion `_. +Technically, you can use +`any string encoding recognized by Python `_ if you feel the need to deviate from UTF-8, +by setting the ``_Encoding`` field in ``encoding``. But +`we don't recommend it`_. + +.. warning:: + + By default, missing values in bytes or unicode string arrays (represented by + ``NaN`` in xarray) are currently written to disk as empty strings ``''``. Thus + missing values will not be restored when data is loaded from disk. + This behavior is likely to change in the future (:issue:`1647`). Chunk based compression ....................... @@ -390,7 +423,7 @@ over the network until we look at particular values: Some servers require authentication before we can access the data. For this purpose we can explicitly create a :py:class:`~xarray.backends.PydapDataStore` -and pass in a `Requests`__ session object. For example for +and pass in a `Requests`__ session object. For example for HTTP Basic authentication:: import xarray as xr @@ -403,7 +436,7 @@ HTTP Basic authentication:: session=session) ds = xr.open_dataset(store) -`Pydap's cas module`__ has functions that generate custom sessions for +`Pydap's cas module`__ has functions that generate custom sessions for servers that use CAS single sign-on. For example, to connect to servers that require NASA's URS authentication:: diff --git a/doc/whats-new.rst b/doc/whats-new.rst index e3d63e7f525..03b50b90616 100644 --- a/doc/whats-new.rst +++ b/doc/whats-new.rst @@ -73,6 +73,13 @@ Breaking changes produce a warning encouraging users to adopt the new syntax. By `Daniel Rothenberg `_. +- Unicode strings (``str`` on Python 3) are now round-tripped successfully even + when written as character arrays (e.g., as netCDF3 files or when using + ``engine='scipy'``) (:issue:`1638`). This is controlled by the ``_Encoding`` + attribute convention, which is also understood directly by the netCDF4-Python + interface. See :ref:`io.string-encoding` for full details. + By `Stephan Hoyer `_. + - ``repr`` and the Jupyter Notebook won't automatically compute dask variables. Datasets loaded with ``open_dataset`` won't automatically read coords from disk when calling ``repr`` (:issue:`1522`).