`iris.cube.Cube.lazy_data` method results in wrong chunk array type for masked arrays #5800

bouweandela · 2024-03-04T16:18:12Z

🐛 Bug Report

How To Reproduce

Steps to reproduce the behaviour:

Run the following code

import numpy as np
import iris.cube

In [1]: iris.cube.Cube(np.ma.array([1, 2], mask=[True, False])).lazy_data()
Out[1]: dask.array<array, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>

Note that chunktype is numpy.ndarray, while the data type is actually a masked array. This causes problems when inspecting the content of the Dask array with dask.array.utils.meta_from_array, because that will return the wrong chunk type shown above.

Expected behaviour

import numpy as np
import iris.cube

In [1]: iris.cube.Cube(np.ma.array([1, 2], mask=[True, False])).lazy_data()
Out[1]: dask.array<array, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.MaskedArray>

Environment

OS & Version: 23.10
Iris Version: 3.9.0.dev15

The text was updated successfully, but these errors were encountered:

bouweandela · 2024-03-05T10:19:41Z

This bug will affect the functions iris.util.broadcast_to_shape and iris.util.rolling_window and it will cause them to strip the mask from lazy arrays.

This issue was introduced in #4135 to speed up loading NetCDF files.

bouweandela · 2024-03-05T14:36:20Z

It looks like arrays loaded from NetCDF files are always masked arrays since netCDF4 1.4 (released almost 6 years ago):
https://github.com/Unidata/netcdf4-python/blob/c7c5f4cc9c00c2d06a196d211436d6a01c53dba6/Changelog#L271-L273
(at least by default, and I don't see any code to modify this default) so for those, it would make sense to just set the array type to numpy.ma.MaskedArray.

trexfeathers · 2024-03-13T10:34:22Z

We're keen to get this fixed, will be discussed during refinement of the two remaining Iris releases this year

pp-mo · 2024-03-28T13:16:45Z

See notes on #5801
It seems like this would need addressing wherever we use da.from_array.
However, this is only called in iris._lazydata.as_lazy_data(). So I we can certainly can fix it there, since it only calls

One thing that still bothers me, though, is what da.stack does, as used here, which is called from our merge code.

From experiment, Dask gives a stacked "mixture" of unmasked and masked arrays a meta of "masked" type.
However, I find that the result of computing portions of this results in an unmasked result array in some cases, rather like netCDF4 variables used to do.
For example :

# A stack formed of normal+masked arrays is masked-type
>>> lazy_nomask = da.from_array([1., 2, 3, 4], meta=np.ndarray)
>>> lazy_masked = da.from_array(
...     np.ma.masked_array([1., 2, 3, 4], [0, 0, 1, 1]),
...     meta=np.ma.MaskedArray((), dtype=float)
... )
>>> combined = da.stack([lazy_nomask, lazy_masked])
>>> print('  ', combined)
   dask.array<stack, shape=(2, 4), dtype=float64, chunksize=(1, 4), chunktype=numpy.MaskedArray>
>>> 

# Sections are of different type depending on the source array they derive from
>>> sec1 = combined[0, 1:3]
>>> sec2 = combined[1, 1:3]
>>> print('  section1 [0, 1:3] =', sec1)
  section1 [0, 1:3] = dask.array<getitem, shape=(2,), dtype=float64, chunksize=(2,), chunktype=numpy.MaskedArray>
>>> print('  section2 [1, 1:3] =', sec2)
  section2 [1, 1:3] = dask.array<getitem, shape=(2,), dtype=float64, chunksize=(2,), chunktype=numpy.MaskedArray>
>>> print('  section1 compute=', repr(sec1.compute()), '  type=', type(sec1.compute()))
  section1 compute= array([2., 3.])   type= <class 'numpy.ndarray'>
>>> print('  section2 compute=', repr(sec2.compute()), '  type=', type(sec2.compute()))
  section2 compute= masked_array(data=[2.0, --],
             mask=[False,  True],
       fill_value=1e+20)   type= <class 'numpy.ma.core.MaskedArray'>

# "Mixed" sections are unified as masked-type
>>> print('[:, 2:3] compute :\n', repr(combined[:, 2:3].compute()))
[:, 2:3] compute :
 masked_array(
  data=[[3.0],
        [--]],
  mask=[[False],
        [ True]],
  fill_value=1e+20)

# An unmasked portion of masked origin is still masked-type
>>> print('[1, :2] compute :\n', repr(combined[1, :2].compute()))
[1, :2] compute :
 masked_array(data=[1.0, 2.0],
             mask=[False, False],
       fill_value=1e+20)

So, this is a potential problem, showing that the 'meta' of a lazy array cannot quite be "trusted" in terms of what it will return.
This is a bit like what happened when we tried to treat a numpy fill-value as metadata : we found that numpy itself was not consistent in preserving it.

@bouweandela what is your view on this ? : I assume is it still valuable to do our best to preserve a correct meta type ?

bouweandela · 2024-04-02T14:14:40Z

what is your view on this ? : I assume is it still valuable to do our best to preserve a correct meta type ?

Yes, I think so. If the inconsistency is limited to dask.array.stack, we can just ensure that all input arrays are of the same array type wherever we use stack in the code. However, if this also happens with many other functions it may be more challenging.

HGWright · 2024-04-10T09:46:20Z

@SciTools/peloton We are happy to move forward with this, understanding that if it breaks Iris in other places we can roll it back. This brings us more inline with dask, which is a good thing.

bouweandela added the Type: Bug label Mar 4, 2024

bouweandela mentioned this issue Mar 4, 2024

Use the correct chunktype for Dask arrays #5801

Merged

scitools-ci bot added this to 🚴 Peloton Mar 13, 2024

trexfeathers added this to the Candidate for next release milestone Mar 13, 2024

bouweandela mentioned this issue Apr 17, 2024

Lazy iris.cube.Cube.rolling_window #5795

Merged

pp-mo closed this as completed in #5801 Apr 26, 2024

github-project-automation bot moved this to Done in 🚴 Peloton Apr 26, 2024

scitools-ci bot removed this from 🚴 Peloton May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`iris.cube.Cube.lazy_data` method results in wrong chunk array type for masked arrays #5800

`iris.cube.Cube.lazy_data` method results in wrong chunk array type for masked arrays #5800

bouweandela commented Mar 4, 2024 •

edited

Loading

bouweandela commented Mar 5, 2024

bouweandela commented Mar 5, 2024 •

edited

Loading

trexfeathers commented Mar 13, 2024

pp-mo commented Mar 28, 2024

bouweandela commented Apr 2, 2024

HGWright commented Apr 10, 2024

iris.cube.Cube.lazy_data method results in wrong chunk array type for masked arrays #5800

iris.cube.Cube.lazy_data method results in wrong chunk array type for masked arrays #5800

Comments

bouweandela commented Mar 4, 2024 • edited Loading

🐛 Bug Report

How To Reproduce

Expected behaviour

Environment

bouweandela commented Mar 5, 2024

bouweandela commented Mar 5, 2024 • edited Loading

trexfeathers commented Mar 13, 2024

pp-mo commented Mar 28, 2024

bouweandela commented Apr 2, 2024

HGWright commented Apr 10, 2024

`iris.cube.Cube.lazy_data` method results in wrong chunk array type for masked arrays #5800

`iris.cube.Cube.lazy_data` method results in wrong chunk array type for masked arrays #5800

bouweandela commented Mar 4, 2024 •

edited

Loading

bouweandela commented Mar 5, 2024 •

edited

Loading