Skip to content

Commit

Permalink
Docs on string encoding
Browse files Browse the repository at this point in the history
  • Loading branch information
shoyer committed Oct 23, 2017
1 parent 6db163f commit e854bdc
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 2 deletions.
37 changes: 35 additions & 2 deletions doc/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,39 @@ converting ``NaN`` to ``-9999``, we would use
``encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue': -9999}}``.
Compression and decompression with such discretization is extremely fast.

.. _io.string-encoding:

String encoding
...............

xarray can write unicode strings to netCDF files in two ways:

- As variable lengths strings. This is only supported on netCDF4 (HDF5) files.
- By encoding strings into bytes, and writing encoded bytes as a character
array. The default encoding is UTF-8.

By default, we use variable length strings for compatible files and fall-back
to using encoded character arrays. Character arrays can be selected even for
netCDF4 files by setting the ``dtype`` field in ``encoding`` to ``S1``
(corresponding to NumPy's single-character bytes dtype).

If character arrays are used, the string encoding that was used is stored on
disk in the ``_Encoding`` attribute, which matches an ad-hoc convention
`adopted by the netCDF4-Python library <https://github.com/Unidata/netcdf4-python/pull/665>`_.
At the time of this writing (October 2017), a standard convention for indicating
string encoding for character arrays in netCDF files was
`still under discussion <https://github.com/Unidata/netcdf-c/issues/402>`_.
Technically, you can use
`any string encoding recognized by Python <https://docs.python.org/3/library/codecs.html#standard-encodings>`_ if you feel the need to deviate from UTF-8,
by setting the ``_Encoding`` field in ``encoding``. But
`we don't recommend it<http://utf8everywhere.org/>`_.

.. warning::

By default, missing values in bytes or unicode string arrays (represented by
``NaN`` in xarray) are currently written to disk as empty strings ``''``. Thus
missing values will not be restored when data is loaded from disk.
This behavior is likely to change in the future (:issue:`1647`).

Chunk based compression
.......................
Expand Down Expand Up @@ -390,7 +423,7 @@ over the network until we look at particular values:

Some servers require authentication before we can access the data. For this
purpose we can explicitly create a :py:class:`~xarray.backends.PydapDataStore`
and pass in a `Requests`__ session object. For example for
and pass in a `Requests`__ session object. For example for
HTTP Basic authentication::

import xarray as xr
Expand All @@ -403,7 +436,7 @@ HTTP Basic authentication::
session=session)
ds = xr.open_dataset(store)

`Pydap's cas module`__ has functions that generate custom sessions for
`Pydap's cas module`__ has functions that generate custom sessions for
servers that use CAS single sign-on. For example, to connect to servers
that require NASA's URS authentication::

Expand Down
7 changes: 7 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,13 @@ Breaking changes
produce a warning encouraging users to adopt the new syntax.
By `Daniel Rothenberg <https://github.com/darothen>`_.

- Unicode strings (``str`` on Python 3) are now round-tripped successfully even
when written as character arrays (e.g., as netCDF3 files or when using
``engine='scipy'``) (:issue:`1638`). This is controlled by the ``_Encoding``
attribute convention, which is also understood directly by the netCDF4-Python
interface. See :ref:`io.string-encoding` for full details.
By `Stephan Hoyer <https://github.com/shoyer>`_.

- ``repr`` and the Jupyter Notebook won't automatically compute dask variables.
Datasets loaded with ``open_dataset`` won't automatically read coords from
disk when calling ``repr`` (:issue:`1522`).
Expand Down

0 comments on commit e854bdc

Please sign in to comment.