Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Dask Arrays #1479

Open
rpmanser opened this issue Aug 24, 2020 · 27 comments
Open

Support for Dask Arrays #1479

rpmanser opened this issue Aug 24, 2020 · 27 comments
Labels
Type: Feature New functionality

Comments

@rpmanser
Copy link
Contributor

rpmanser commented Aug 24, 2020

As pieces for compatibility between Dask, Xarray, and Pint are coming together (e.g., xarray/#4221, pint/#1129, pint#1151, dask/dask#6393, among others that I likely missed) we should begin keeping track of progress for Dask Array support in MetPy. This might help address #996, or the previously mentioned PRs may already allow for it.

I started testing support for Dask Arrays in #1388, but that should probably be separated from that PR. Most functions in the basic.py module just work with Pint wrapped Dask Arrays without extra effort, but there are some exceptions. These include wind_direction, heat_index, apparent_temperature (working on heat_index should fix this), and pressure_to_height_std. The first two/three are a result of ambiguous array sizes due to boolean indexing. For example in wind_direction:

mask = wdir <= 0
if np.any(mask):
wdir[mask] += 360. * units.deg

mask = wdir <= 0 produces a boolean array of undetermined size, which leads to an axis of undetermined size in the Dask Array, which is given a nan value e.g., (50, nan, 100). This is expected Dask behavior AFAIK. That should be pretty easy to fix with an np.where.

I haven't dug into why pressure_to_height_std doesn't work. I'll see if I can do that soon.

Another problem that is likely to come up, but I haven't dug into yet, involves scipy functions. I'm not sure how these will interact with Dask Arrays but there may need to be some special handling for that. Specifically I'm thinking of any vertical profile calculations (e.g., CAPE)... chunking is going to be very important for those.

Some discussion points:

  • To what extent will MetPy support Dask Arrays? Should it be limited to certain modules? I think metpy.calc is a good place to start.
  • Some refactoring will probably be needed for individual functions (e.g., wind_direction). Changes in performance to accommodate Dask support should be considered for each case.
  • Should we avoid explicit use of Dask functions so that Dask is not a dependency? Will this be possible?

cc: @jthielen @tjwixtrom

@rpmanser rpmanser added the Type: Feature New functionality label Aug 24, 2020
@dopplershift
Copy link
Member

Thanks for starting the issue. A few thoughts I have right now:

To what extent will MetPy support Dask Arrays? Should it be limited to certain modules? I think metpy.calc is a good place to start.

metpy.calc and metpy.interpolation seem like natural candidates. I don't think plots and io need specific support, but I'm open to being convinced otherwise.

Some refactoring will probably be needed for individual functions (e.g., wind_direction). Changes in performance to accommodate Dask support should be considered for each case.

I'm ok with making changes to support. What we should do then is document in the developer's guide what the subset of the numpy and/or xarray API we're allowed to use is.

Should we avoid explicit use of Dask functions so that Dask is not a dependency? Will this be possible?

Everything is always up for discussion, but my starting point is that I don't think MetPy should ever have a dask dependence. It should work like we want to get to with CartoPy--if you use it, we'll play along (i.e. do operations that work with dask).

Another problem that is likely to come up, but I haven't dug into yet, involves scipy functions.

For this and the previous item, we may be able to hang back ever so slightly and surf the wave of changes coming in various NEPs and other things that are trying to help solve these and other issues (JAX/PyTorch/TensorFlow arrays) elsewhere.

I think this is going to require a very pragmatic and use-case-driven approach. I don't want to put off useful things forever (e.g. dask-powered gridded cape); I also don't want to compromise any of our simplicity (for our users) in the name of supporting dask. I also don't want to invest significant effort in certain areas if waiting will solve them for us.

Just the thoughts that came to mind for now. Don't consider any of it set in stone.

@jthielen
Copy link
Collaborator

Great summary @rpmanser! I added dask/dask#6393 to your list of upstream PRs, which looks to be merged very soon.

Here are my thoughts on your discussion points at the moment (many of which overlap with @dopplershift's comment that came up right before I posted mine):

I think MetPy should take a progressive approach to Dask Array support, starting with the easy options (e.g., scalar thermo calculations) and gradually moving towards full (or near full) support in the long-term. I would love to eventually be able to do everything in MetPy with xarray > Pint > Dask data structures, but this would have a rather long time horizon and would intersect with issues in the broader community (rechunking; generic non-local calculations like interpolation, smoothing, derivatives, and integration; SciPy support; etc.).

That being said, I don't think Dask should ever become a required dependency of MetPy. I think it should always be possible to avoid unconditional use of Dask functions, albeit at the cost of increased complexity. But, the trade-off (complexity vs. supporting both Dask and non-Dask use) is worth it in my mind.

@jthielen
Copy link
Collaborator

Also, if I had to guess, your issue with pressure_to_height_std is due to type hierarchy issues. See if it is resolved with Dask master after dask/dask#6393 is merged.

@dopplershift
Copy link
Member

Just to clarify something I said: Complexity is always an option in terms of MetPy implementation; we just don't want to present any more complexity to the users than is absolutely necessary. This is also why we need a good test suite of the options. 😉

@rpmanser
Copy link
Contributor Author

Great, this is in line with what I was thinking. Avoiding Dask as a required dependency, starting simple and working towards more complex issues, and not compromising simplicity are generally good ideas. Once these PRs are in I'll start resolving issues in basic.py, then we can go from there.

@jthielen
Copy link
Collaborator

jthielen commented Sep 2, 2020

With dask/dask#6393 being merged, it looks like just pydata/xarray#4221 is left for upstream PRs here (and that one is approved and awaiting merge). This is getting close!

Update: not even a half-hour later and pydata/xarray#4221 has been merged!

@raybellwaves
Copy link
Contributor

I'm interested is something similar to the ideas proposed in the thread. I can move this out to a separate issue if desired.

I would like to drop in metpy.calc.wind_speed instead of xarray.ufuncs.sqrt in my workflow. I think this is inline with the ideas discussed here.

import xarray as xr
import xarray.ufuncs as xu
import metpy
import metpy.calc

ds = xr.open_dataset('https://thredds.ucar.edu/thredds/dodsC/grib/NCEP/GFS/Global_0p25deg/Best')
ds = ds.metpy.parse_cf()

u10 = ds['u-component_of_wind_height_above_ground'].sel(height_above_ground4=10, lat=50, lon=50)
v10 = ds['v-component_of_wind_height_above_ground'].sel(height_above_ground4=10, lat=50, lon=50)

ws10 = xu.sqrt(u10 ** 2 + v10**2)
ws10 = metpy.calc.wind_speed(u10, v10)

@rpmanser
Copy link
Contributor Author

@raybellwaves I ran your example and it does not seem that your data are represented as dask arrays. Were you intending for them to be? I'm not as familiar with thredds as I ought to be, but my understanding is that the data it downloads is loaded into memory. If that is the case, then dask arrays may not be helpful, since part of their purpose is to load data into memory only when .compute() is called.

Maybe someone more knowledgeable with thredds can help here...

@dopplershift
Copy link
Member

Here it's using OPeNDAP, so it should be possible to only download data on demand. However, I'm not really clear on how to engage dask with a single dataset loaded with OPeNDAP. Or maybe it depends on the native chunking in the dataset?

@jthielen
Copy link
Collaborator

I unfortunately don't have time to help investigate here right now (have you tried specifying chunks in the open_dataset call?), but chunks via OPeNDAP reminded me of this issue, which may be of interest: pydata/xarray#1440.

@raybellwaves
Copy link
Contributor

but my understanding is that the data it downloads is loaded into memory.

Ah thanks. My question is probably more suited to xarray-metpy integration rather that dask-metpy integration

@rpmanser
Copy link
Contributor Author

Despite discussion of using other libraries to speed up MetPy, I would still like to see dask supported. I think most of the points brought up here haven't changed. Since the idea behind #1388 is an absolute beast, it may be best to set that aside and create tests only for dask arrays. I'm not sure what that would look like, so I'd like to know if anyone has thoughts on it. A simple approach might be to just add a test against dask for each function, module by module, but that will bloat up the test suite quite quickly. Is there a better solution?

@dopplershift
Copy link
Member

@rpmanser Well while we might not be looking to use Dask internally as a performance solution, we definitely do want to support its use for others who have gone to the effort of setting up their data to do so.

What I wonder is if we could do something like (completely untested code):

import pytest

parameterize_array_types = pytest.parameterize('array_type', [np.array, dask.array, np.ma.array])

@parameterize_array_types
def test_calc(array_type)
    test_input = units.Quantity(array_type(data), 'myunits')
    truth = units.Quantity(array_type(truth_data), 'newunits')

   result = calc(test_input)
   assert_almost_equal(test_input, truth)

I haven't thought through what the challenges with that, but I think that would allow us to start testing things like Dask without copy-pasting a whole bunch of tests?

@rpmanser
Copy link
Contributor Author

@dopplershift that seems pretty straightforward and looks like it will work. I'll give that a try on the basic module in calc soon then we can go from there.

@lpilz
Copy link
Contributor

lpilz commented Feb 15, 2022

So I am running into a related issue (or maybe even the same?) When calculating the wind direction from 2 3D dask arrays (U10, V10 shape: (169, 185, 200)), I get an IndexError here:

wdir[mask] += units.Quantity(360., 'deg')

Incorrect shape ((169, 185, 200)) of integer indices for dimension with size 169

I am seeing that wdir[mask] looks like dask.array<getitem, shape=(nan,), dtype=float32, chunksize=(nan,), chunktype=numpy.ndarray> degree with shape (nan,) - so maybe this is the problem?

@dopplershift
Copy link
Member

@lpilz The problem is that dask doesn't support the following in general:

a[a > 2] += 4

I haven't dug in to figure out why it doesn't or what it would take. I'm open to reworking our implementation of wind_direction, but I'm not sure you how you efficiently do that operation without boolean masking.

@lpilz
Copy link
Contributor

lpilz commented Feb 15, 2022

@dopplershift Thanks for explaining, that clarifies things. Could one maybe turn it on its head and do something like this or is the problem that bool arrays don't work?

a += (a > 2)*4

@dopplershift
Copy link
Member

@lpilz Honestly, I have no idea what the problem is. If a dask user with a vested interest 😉 were to submit a PR with that change and it didn't break any tests (and fixed dask) then I'd be willing to merge it.

@sgdecker
Copy link
Contributor

sgdecker commented Feb 1, 2023

Not sure if this should go here or in a new issue (or even if this is an issue), but with this code:

import dask.array as da
import metpy.calc as mpcalc
from metpy.units import units

p = da.asarray([1000, 900, 800, 700, 600, 500, 400, 300, 200, 100]) * units('hPa')
T = 20 * units('degC')
Td = 18 * units('degC')
prof = mpcalc.parcel_profile(p, T, Td)

I get ominous warnings:

/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.copyto` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.place` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.copyto` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.place` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.copyto` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(
/home/decker/local/miniconda3/envs/dasktest/lib/python3.11/site-packages/dask/array/core.py:1711: FutureWarning: The `numpy.place` function is not implemented by Dask array. You may want to use the da.map_blocks function or something similar to silence this warning. Your code may stop working in a future release.
  warnings.warn(

As far as I can tell, MetPy does not directly call the functions in question, but it would certainly be nice if this code did not stop working in a future release.

@dopplershift
Copy link
Member

@sgdecker I think it's good to capture it here. This particular function is high on our list to make sure it works with Dask for our current funded work on MetPy.

Out of curiosity, what's the scale of the data you're using for your "real" problem? The example above with a single profile isn't really doing much with Dask.

@sgdecker
Copy link
Contributor

sgdecker commented Feb 1, 2023

@dopplershift Yes, this is scaled down for example purposes. The real data is climate model output, so we are looking at something on the order of 10,000 gridpoints and thousands of times.

@dopplershift
Copy link
Member

@sgdecker I'll be sure to reach out when we have something, then. That's literally one of the core workflows/use cases that we're trying to support with the funded work we have.

@rbavery
Copy link

rbavery commented Aug 12, 2024

Hello curious if there is an update on Dask support?

@dopplershift
Copy link
Member

Nothing really new to report. I'm curious if you've tried using dask with MetPy and run into any calculations in particular that cause you problems?

@rbavery
Copy link

rbavery commented Aug 16, 2024

We haven't started anything yet but might be using MetPy to retrieve NexRAD data, filtering for specific variables, convert from polar to geographic coordinates, calculate variables with MetPy, and write the results to Zarr. It seems like one of the more user friendly packages for reading Level 2 or Level 3 data.

@dopplershift
Copy link
Member

You might want to consider xradar for that, at least for level 2, since that's reading nexrad level 2 directly into xarray.

@rbavery
Copy link

rbavery commented Aug 16, 2024

Thank you! Will take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature New functionality
Projects
None yet
Development

No branches or pull requests

7 participants