Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iris.cube.Cube.lazy_data method results in wrong chunk array type for masked arrays #5800

Closed
bouweandela opened this issue Mar 4, 2024 · 6 comments · Fixed by #5801
Closed

Comments

@bouweandela
Copy link
Member

bouweandela commented Mar 4, 2024

🐛 Bug Report

How To Reproduce

Steps to reproduce the behaviour:

Run the following code

import numpy as np
import iris.cube

In [1]: iris.cube.Cube(np.ma.array([1, 2], mask=[True, False])).lazy_data()
Out[1]: dask.array<array, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>

Note that chunktype is numpy.ndarray, while the data type is actually a masked array. This causes problems when inspecting the content of the Dask array with dask.array.utils.meta_from_array, because that will return the wrong chunk type shown above.

Expected behaviour

import numpy as np
import iris.cube

In [1]: iris.cube.Cube(np.ma.array([1, 2], mask=[True, False])).lazy_data()
Out[1]: dask.array<array, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.MaskedArray>

Environment

  • OS & Version: 23.10
  • Iris Version: 3.9.0.dev15
@bouweandela
Copy link
Member Author

This bug will affect the functions iris.util.broadcast_to_shape and iris.util.rolling_window and it will cause them to strip the mask from lazy arrays.

This issue was introduced in #4135 to speed up loading NetCDF files.

@bouweandela
Copy link
Member Author

bouweandela commented Mar 5, 2024

It looks like arrays loaded from NetCDF files are always masked arrays since netCDF4 1.4 (released almost 6 years ago):
https://github.com/Unidata/netcdf4-python/blob/c7c5f4cc9c00c2d06a196d211436d6a01c53dba6/Changelog#L271-L273
(at least by default, and I don't see any code to modify this default) so for those, it would make sense to just set the array type to numpy.ma.MaskedArray.

@trexfeathers
Copy link
Contributor

We're keen to get this fixed, will be discussed during refinement of the two remaining Iris releases this year

@pp-mo
Copy link
Member

pp-mo commented Mar 28, 2024

See notes on #5801
It seems like this would need addressing wherever we use da.from_array.
However, this is only called in iris._lazydata.as_lazy_data(). So I we can certainly can fix it there, since it only calls

One thing that still bothers me, though, is what da.stack does, as used here, which is called from our merge code.

From experiment, Dask gives a stacked "mixture" of unmasked and masked arrays a meta of "masked" type.
However, I find that the result of computing portions of this results in an unmasked result array in some cases, rather like netCDF4 variables used to do.
For example :

# A stack formed of normal+masked arrays is masked-type
>>> lazy_nomask = da.from_array([1., 2, 3, 4], meta=np.ndarray)
>>> lazy_masked = da.from_array(
...     np.ma.masked_array([1., 2, 3, 4], [0, 0, 1, 1]),
...     meta=np.ma.MaskedArray((), dtype=float)
... )
>>> combined = da.stack([lazy_nomask, lazy_masked])
>>> print('  ', combined)
   dask.array<stack, shape=(2, 4), dtype=float64, chunksize=(1, 4), chunktype=numpy.MaskedArray>
>>> 

# Sections are of different type depending on the source array they derive from
>>> sec1 = combined[0, 1:3]
>>> sec2 = combined[1, 1:3]
>>> print('  section1 [0, 1:3] =', sec1)
  section1 [0, 1:3] = dask.array<getitem, shape=(2,), dtype=float64, chunksize=(2,), chunktype=numpy.MaskedArray>
>>> print('  section2 [1, 1:3] =', sec2)
  section2 [1, 1:3] = dask.array<getitem, shape=(2,), dtype=float64, chunksize=(2,), chunktype=numpy.MaskedArray>
>>> print('  section1 compute=', repr(sec1.compute()), '  type=', type(sec1.compute()))
  section1 compute= array([2., 3.])   type= <class 'numpy.ndarray'>
>>> print('  section2 compute=', repr(sec2.compute()), '  type=', type(sec2.compute()))
  section2 compute= masked_array(data=[2.0, --],
             mask=[False,  True],
       fill_value=1e+20)   type= <class 'numpy.ma.core.MaskedArray'>

# "Mixed" sections are unified as masked-type
>>> print('[:, 2:3] compute :\n', repr(combined[:, 2:3].compute()))
[:, 2:3] compute :
 masked_array(
  data=[[3.0],
        [--]],
  mask=[[False],
        [ True]],
  fill_value=1e+20)

# An unmasked portion of masked origin is still masked-type
>>> print('[1, :2] compute :\n', repr(combined[1, :2].compute()))
[1, :2] compute :
 masked_array(data=[1.0, 2.0],
             mask=[False, False],
       fill_value=1e+20)

So, this is a potential problem, showing that the 'meta' of a lazy array cannot quite be "trusted" in terms of what it will return.
This is a bit like what happened when we tried to treat a numpy fill-value as metadata : we found that numpy itself was not consistent in preserving it.

@bouweandela what is your view on this ? : I assume is it still valuable to do our best to preserve a correct meta type ?

@bouweandela
Copy link
Member Author

what is your view on this ? : I assume is it still valuable to do our best to preserve a correct meta type ?

Yes, I think so. If the inconsistency is limited to dask.array.stack, we can just ensure that all input arrays are of the same array type wherever we use stack in the code. However, if this also happens with many other functions it may be more challenging.

@HGWright
Copy link
Contributor

@SciTools/peloton We are happy to move forward with this, understanding that if it breaks Iris in other places we can roll it back. This brings us more inline with dask, which is a good thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants