netCDF4: support byte strings as attribute values #7186

krihabu · 2022-10-19T09:58:04Z

What is your issue?

When I have a string attribute with special characters like '°' or German Umlauts (Ä, Ü, etc) it will get written to file as type NC_STRING. Other string attributes not containing any special characters will be saved as NC_CHAR.

This leads to problems when I subsequently want to open this file with NetCDF-Fortran, because it does not fully support NC_STRING.

So my question is:
Is there a way to force xarray to write the string attribute as NC_CHAR?

Example

import numpy as np
import xarray as xr

data = np.ones([12, 10])
ds = xr.Dataset({"data": (["x", "y"], data)}, coords={"x": np.arange(12), "y": np.arange(10)})
ds["x"].attrs["first_str"] = "foo"
ds["x"].attrs["second_str"] = "bar°"
ds["x"].attrs["third_str"] = "hää"
ds.to_netcdf("testds.nc")

The output of ncdump -h looks like this, which shows the different data type of the second and third attribute:

The text was updated successfully, but these errors were encountered:

tovogt · 2022-10-22T17:37:05Z

The reason for this behavior is that the netcdf4 python package automatically determines the type of the attribute (NC_CHAR or NC_STRING) by attempting the conversion to ASCII: Unidata/netcdf4-python#529 However, if the value is a byte string, no conversion is done. So, the easiest solution would be to manually encode as utf-8 and then passing the byte string to netcdf4. Unfortunately, xarray doesn't support byte strings as attribute values even though this is a valid data type for the netcdf4 engine:

xarray/xarray/backends/api.py

Line 175 in 6cb97f6

valid_types = (str, Number, np.ndarray, np.number, list, tuple)

In the long term, I would suggest to add bytes as a supported type in that list above on xarray's side.

A quick workaround for you might be to encode the string as utf-8 and convert to a numpy array, since xarray accepts numpy arrays as data type and netcdf4 automatically extracts the data if the array contains only a single item:

ds["x"].attrs["third_str"] = np.array("hää".encode("utf-8"))

krihabu · 2022-10-24T11:08:19Z

Thank you @tovogt! The workaround fits my needs for now.

Nevertheless, I would leave this issue open as a note for @tovogt 's suggestion to add bytes as a supported type for attributes.

hollymandel · 2024-08-25T18:00:17Z

I plan to implement this in the next couple days.

dcherian · 2024-08-26T18:58:33Z

That would be much appreciated. Thanks!

hollymandel · 2024-08-27T18:46:54Z

A comment about the proposed solution. Allowing bytes as a valid_type for attributes will allow not just encoded strings but arbitrary binaries to be saved as attributes (after conversion to strings). I'm not sure to what extent this is inherently problematic/unsafe, but a concrete issue is that the h5netcdf engine does not accept null characters. (You need to add bytes to valid_types to run the following example.)

import numpy as np
import xarray as xr

all_bytes = bytes(range(256))
good_bytes = "bar°".encode("UTF-8")
bad_byte = b'\x00'
not_bytes = "hää"

data = np.ones([1])
ds = xr.Dataset({"data": (["x"], data)}, coords={"x": np.arange(1)})
# ds["x"].attrs["first_str"] = all_bytes
ds["x"].attrs["second_str"] = bad_byte
ds["x"].attrs["third_str"] = good_bytes
ds["x"].attrs["fourth_str"] = not_bytes

# ds.to_netcdf("testds.nc", engine = "netcdf4")
# ds.to_netcdf("testds.nc", engine = "scipy")
ds.to_netcdf("testds.nc", engine = "h5netcdf")

!ncdump -h testds.nc

I propose adding a check to check_attr that if an attribute has type bytes it is a utf-8 encoded string free of null characters:

    def check_attr(name, value, valid_types):
        ...

        if isinstance(value, bytes):
            try:
                value.decode('utf-8')
            except UnicodeDecodeError as e:
                raise ValueError(
                    f"Invalid value provided for attribute '{name!r}': {value!r}. "
                    "Only binary data derived from UTF-8 encoded strings is allowed."
                ) from e
        
            if b'\x00' in value:
                raise ValueError(
                    f"Invalid value provided for attribute '{name!r}': {value!r}. "
                    "Null characters are not permitted."
                )

dcherian · 2024-08-27T18:59:03Z

If it's only unsupported by h5netcdf, it would be good to specialize the check to just that engine.

krihabu added the needs triage Issue that has not been reviewed by xarray team member label Oct 19, 2022

dcherian added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 17, 2023

dcherian changed the title ~~Force the writing of string attribute as NC_CHAR (because of netcdf-fortran incompatibility with NC_STRING)~~ netCDF4: support byte strings as attribute values Jan 17, 2023

dcherian added contrib-help-wanted contrib-good-first-issue labels Jan 17, 2023

hollymandel mentioned this issue Aug 28, 2024

Byte attr support #9407

Merged

3 tasks

max-sixty closed this as completed in #9407 Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

netCDF4: support byte strings as attribute values #7186

netCDF4: support byte strings as attribute values #7186

krihabu commented Oct 19, 2022 •

edited

Loading

tovogt commented Oct 22, 2022 •

edited

Loading

krihabu commented Oct 24, 2022

hollymandel commented Aug 25, 2024

dcherian commented Aug 26, 2024

hollymandel commented Aug 27, 2024 •

edited

Loading

dcherian commented Aug 27, 2024

netCDF4: support byte strings as attribute values #7186

netCDF4: support byte strings as attribute values #7186

Comments

krihabu commented Oct 19, 2022 • edited Loading

What is your issue?

tovogt commented Oct 22, 2022 • edited Loading

krihabu commented Oct 24, 2022

hollymandel commented Aug 25, 2024

dcherian commented Aug 26, 2024

hollymandel commented Aug 27, 2024 • edited Loading

dcherian commented Aug 27, 2024

krihabu commented Oct 19, 2022 •

edited

Loading

tovogt commented Oct 22, 2022 •

edited

Loading

hollymandel commented Aug 27, 2024 •

edited

Loading