Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netCDF4: support byte strings as attribute values #7186

Closed
krihabu opened this issue Oct 19, 2022 · 6 comments · Fixed by #9407
Closed

netCDF4: support byte strings as attribute values #7186

krihabu opened this issue Oct 19, 2022 · 6 comments · Fixed by #9407

Comments

@krihabu
Copy link

krihabu commented Oct 19, 2022

What is your issue?

When I have a string attribute with special characters like '°' or German Umlauts (Ä, Ü, etc) it will get written to file as type NC_STRING. Other string attributes not containing any special characters will be saved as NC_CHAR.

This leads to problems when I subsequently want to open this file with NetCDF-Fortran, because it does not fully support NC_STRING.

So my question is:
Is there a way to force xarray to write the string attribute as NC_CHAR?

Example

import numpy as np
import xarray as xr

data = np.ones([12, 10])
ds = xr.Dataset({"data": (["x", "y"], data)}, coords={"x": np.arange(12), "y": np.arange(10)})
ds["x"].attrs["first_str"] = "foo"
ds["x"].attrs["second_str"] = "bar°"
ds["x"].attrs["third_str"] = "hää"
ds.to_netcdf("testds.nc")

The output of ncdump -h looks like this, which shows the different data type of the second and third attribute:
grafik

@krihabu krihabu added the needs triage Issue that has not been reviewed by xarray team member label Oct 19, 2022
@tovogt
Copy link
Contributor

tovogt commented Oct 22, 2022

The reason for this behavior is that the netcdf4 python package automatically determines the type of the attribute (NC_CHAR or NC_STRING) by attempting the conversion to ASCII: Unidata/netcdf4-python#529 However, if the value is a byte string, no conversion is done. So, the easiest solution would be to manually encode as utf-8 and then passing the byte string to netcdf4. Unfortunately, xarray doesn't support byte strings as attribute values even though this is a valid data type for the netcdf4 engine:

valid_types = (str, Number, np.ndarray, np.number, list, tuple)

In the long term, I would suggest to add bytes as a supported type in that list above on xarray's side.

A quick workaround for you might be to encode the string as utf-8 and convert to a numpy array, since xarray accepts numpy arrays as data type and netcdf4 automatically extracts the data if the array contains only a single item:

ds["x"].attrs["third_str"] = np.array("hää".encode("utf-8"))

@krihabu
Copy link
Author

krihabu commented Oct 24, 2022

Thank you @tovogt! The workaround fits my needs for now.

Nevertheless, I would leave this issue open as a note for @tovogt 's suggestion to add bytes as a supported type for attributes.

@dcherian dcherian added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 17, 2023
@dcherian dcherian changed the title Force the writing of string attribute as NC_CHAR (because of netcdf-fortran incompatibility with NC_STRING) netCDF4: support byte strings as attribute values Jan 17, 2023
@hollymandel
Copy link
Contributor

I plan to implement this in the next couple days.

@dcherian
Copy link
Contributor

That would be much appreciated. Thanks!

@hollymandel
Copy link
Contributor

hollymandel commented Aug 27, 2024

A comment about the proposed solution. Allowing bytes as a valid_type for attributes will allow not just encoded strings but arbitrary binaries to be saved as attributes (after conversion to strings). I'm not sure to what extent this is inherently problematic/unsafe, but a concrete issue is that the h5netcdf engine does not accept null characters. (You need to add bytes to valid_types to run the following example.)

import numpy as np
import xarray as xr

all_bytes = bytes(range(256))
good_bytes = "bar°".encode("UTF-8")
bad_byte = b'\x00'
not_bytes = "hää"

data = np.ones([1])
ds = xr.Dataset({"data": (["x"], data)}, coords={"x": np.arange(1)})
# ds["x"].attrs["first_str"] = all_bytes
ds["x"].attrs["second_str"] = bad_byte
ds["x"].attrs["third_str"] = good_bytes
ds["x"].attrs["fourth_str"] = not_bytes

# ds.to_netcdf("testds.nc", engine = "netcdf4")
# ds.to_netcdf("testds.nc", engine = "scipy")
ds.to_netcdf("testds.nc", engine = "h5netcdf")

!ncdump -h testds.nc

I propose adding a check to check_attr that if an attribute has type bytes it is a utf-8 encoded string free of null characters:

    def check_attr(name, value, valid_types):
        ...

        if isinstance(value, bytes):
            try:
                value.decode('utf-8')
            except UnicodeDecodeError as e:
                raise ValueError(
                    f"Invalid value provided for attribute '{name!r}': {value!r}. "
                    "Only binary data derived from UTF-8 encoded strings is allowed."
                ) from e
        
            if b'\x00' in value:
                raise ValueError(
                    f"Invalid value provided for attribute '{name!r}': {value!r}. "
                    "Null characters are not permitted."
                )
            

@dcherian
Copy link
Contributor

If it's only unsupported by h5netcdf, it would be good to specialize the check to just that engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants