Combine `UnsignedIntegerCoder` and `CFMaskCoder` #9266

dcherian · 2024-07-22T17:31:06Z

What is your issue?

Longer discussion here: #9258 (comment) but fundamentally, it's not possible to handle the _Unsigned attribute and masking separately so we'll need to merge these.

@djhoese will you be able to send a PR for this please?

The text was updated successfully, but these errors were encountered:

djhoese · 2024-07-22T17:46:16Z

I will try my best, but no guarantees on timeliness. I'll mostly be using this as yet another source of procrastination on the work I'm supposed to be doing. Anyway...

I think the high-level summary of our discussion on my PR (#9258) and the PR that came before it (#9136) as far as changes needed are:

_FillValue as far as the user is concerned should always be the on-disk type. For the case where _Unsigned: "true" that means it will be signed; even in-memory. This means it will not always match the dtype of the data array it is attached/assigned to.
Absorb the UnsignedIntegerCoder into the CFMaskCoder (or some logical not-too-long rename) so that _FillValue can be decoded to unsigned type if necessary, used for masking, and then discarded in favor of CF compliance and retaining the on-disk type of the _FillValue attribute.
If it makes sense, add a serialization warning when in-memory _FillValues go through the encode process and they do not already match the on-disk dtype. For example, a uint8 DataArray with FillValue of 255, but is going to be saved to a NetCDF file as an int8 variable via _Unsigned: "true". 255 can be converted to a signed int8, but it will result in a warning as -1 should have been used instead.

@dcherian you seemed to know that xarray can't load integer arrays as integers and instead produces floats. Is this by design or something that is too complicated to fix?

keewis · 2024-07-22T18:12:34Z

you seemed to know that xarray can't load integer arrays as integers and instead produces floats. Is this by design or something that is too complicated to fix?

The reason for this is that there is no representation for masked arrays with dtypes other than complex and float (and the new string dtype) that support masked values – and complex and float reuse nan for this, which actually means something else. Obviously that also means that if you don't need to apply _FillValue the conversion of other dtypes to float does not happen.

The logic is here:

xarray/xarray/core/dtypes.py

Lines 49 to 88 in ad35a10

    
           def maybe_promote(dtype: np.dtype) -> tuple[np.dtype, Any]: 
        
               """Simpler equivalent of pandas.core.common._maybe_promote 
        
               Parameters 
        
               ---------- 
        
               dtype : np.dtype 
        
               Returns 
        
               ------- 
        
               dtype : Promoted dtype that can hold missing values. 
        
               fill_value : Valid missing value for the promoted dtype. 
        
               """ 
        
               # N.B. these casting rules should match pandas 
        
               dtype_: np.typing.DTypeLike 
        
               fill_value: Any 
        
               if isdtype(dtype, "real floating"): 
        
                   dtype_ = dtype 
        
                   fill_value = np.nan 
        
               elif isinstance(dtype, np.dtype) and np.issubdtype(dtype, np.timedelta64): 
        
                   # See https://github.com/numpy/numpy/issues/10685 
        
                   # np.timedelta64 is a subclass of np.integer 
        
                   # Check np.timedelta64 before np.integer 
        
                   fill_value = np.timedelta64("NaT") 
        
                   dtype_ = dtype 
        
               elif isdtype(dtype, "integral"): 
        
                   dtype_ = np.float32 if dtype.itemsize <= 2 else np.float64 
        
                   fill_value = np.nan 
        
               elif isdtype(dtype, "complex floating"): 
        
                   dtype_ = dtype 
        
                   fill_value = np.nan + np.nan * 1j 
        
               elif isinstance(dtype, np.dtype) and np.issubdtype(dtype, np.datetime64): 
        
                   dtype_ = dtype 
        
                   fill_value = np.datetime64("NaT") 
        
               else: 
        
                   dtype_ = object 
        
                   fill_value = np.nan 
        
               dtype_out = np.dtype(dtype_) 
        
               fill_value = dtype_out.type(fill_value) 
        
               return dtype_out, fill_value

To fix that we need one of (or both):

a maskedarray implementation that behaves like a normal array (for example marray)
truly nullable dtypes

Both of these are sufficiently hard to get right that nothing other than the pandas extension dtypes exist (that I'm aware of). To give you an idea of how hard this is, there are several deferred NEPs that attempt to resolve this (NEPs 12, 24, 25, 26).

djhoese · 2024-07-22T18:16:54Z

Well it seems obvious when you say it like that 😉

Ok so I was focused on the encoding/decoding of dtypes, but it is more about the masking and operations that need to know the validity of a value. So while I might know that my data quality flag that is uint8 where a value of 1 means "bad"/mask, that doesn't mean that any numpy operation will know that even though I probably wouldn't use the variable in normal arithmetic operations anyway.

djhoese · 2024-07-24T16:17:51Z

FYI I'm almost done with this at least as a first draft. I'll need to clean it up a lot as right now it is mostly a copy/paste job. Right now I'm just trying to get tests to pass and make sense. The hardest part is figuring out when things should be in .encoding and when they should be in .attrs, but I'm almost there.

dcherian · 2024-08-26T13:39:36Z

Closed by #9274

dcherian added the topic-CF conventions label Jul 22, 2024

djhoese mentioned this issue Jul 24, 2024

Combine UnsignedIntegerCoder and CFMaskCoder #9274

Merged

4 tasks

dcherian closed this as completed Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine `UnsignedIntegerCoder` and `CFMaskCoder` #9266

Combine `UnsignedIntegerCoder` and `CFMaskCoder` #9266

dcherian commented Jul 22, 2024

djhoese commented Jul 22, 2024

keewis commented Jul 22, 2024 •

edited

Loading

djhoese commented Jul 22, 2024

djhoese commented Jul 24, 2024

dcherian commented Aug 26, 2024

Combine UnsignedIntegerCoder and CFMaskCoder #9266

Combine UnsignedIntegerCoder and CFMaskCoder #9266

Comments

dcherian commented Jul 22, 2024

What is your issue?

djhoese commented Jul 22, 2024

keewis commented Jul 22, 2024 • edited Loading

djhoese commented Jul 22, 2024

djhoese commented Jul 24, 2024

dcherian commented Aug 26, 2024

Combine `UnsignedIntegerCoder` and `CFMaskCoder` #9266

Combine `UnsignedIntegerCoder` and `CFMaskCoder` #9266

keewis commented Jul 22, 2024 •

edited

Loading