-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Dask Arrays #1479
Comments
Thanks for starting the issue. A few thoughts I have right now:
I'm ok with making changes to support. What we should do then is document in the developer's guide what the subset of the numpy and/or xarray API we're allowed to use is.
Everything is always up for discussion, but my starting point is that I don't think MetPy should ever have a dask dependence. It should work like we want to get to with CartoPy--if you use it, we'll play along (i.e. do operations that work with dask).
For this and the previous item, we may be able to hang back ever so slightly and surf the wave of changes coming in various NEPs and other things that are trying to help solve these and other issues (JAX/PyTorch/TensorFlow arrays) elsewhere. I think this is going to require a very pragmatic and use-case-driven approach. I don't want to put off useful things forever (e.g. dask-powered gridded cape); I also don't want to compromise any of our simplicity (for our users) in the name of supporting dask. I also don't want to invest significant effort in certain areas if waiting will solve them for us. Just the thoughts that came to mind for now. Don't consider any of it set in stone. |
Great summary @rpmanser! I added dask/dask#6393 to your list of upstream PRs, which looks to be merged very soon. Here are my thoughts on your discussion points at the moment (many of which overlap with @dopplershift's comment that came up right before I posted mine): I think MetPy should take a progressive approach to Dask Array support, starting with the easy options (e.g., scalar thermo calculations) and gradually moving towards full (or near full) support in the long-term. I would love to eventually be able to do everything in MetPy with xarray > Pint > Dask data structures, but this would have a rather long time horizon and would intersect with issues in the broader community (rechunking; generic non-local calculations like interpolation, smoothing, derivatives, and integration; SciPy support; etc.). That being said, I don't think Dask should ever become a required dependency of MetPy. I think it should always be possible to avoid unconditional use of Dask functions, albeit at the cost of increased complexity. But, the trade-off (complexity vs. supporting both Dask and non-Dask use) is worth it in my mind. |
Also, if I had to guess, your issue with |
Just to clarify something I said: Complexity is always an option in terms of MetPy implementation; we just don't want to present any more complexity to the users than is absolutely necessary. This is also why we need a good test suite of the options. 😉 |
Great, this is in line with what I was thinking. Avoiding Dask as a required dependency, starting simple and working towards more complex issues, and not compromising simplicity are generally good ideas. Once these PRs are in I'll start resolving issues in |
With dask/dask#6393 being merged, it looks like just pydata/xarray#4221 is left for upstream PRs here (and that one is approved and awaiting merge). This is getting close! Update: not even a half-hour later and pydata/xarray#4221 has been merged! |
I'm interested is something similar to the ideas proposed in the thread. I can move this out to a separate issue if desired. I would like to drop in
|
@raybellwaves I ran your example and it does not seem that your data are represented as dask arrays. Were you intending for them to be? I'm not as familiar with thredds as I ought to be, but my understanding is that the data it downloads is loaded into memory. If that is the case, then dask arrays may not be helpful, since part of their purpose is to load data into memory only when Maybe someone more knowledgeable with thredds can help here... |
Here it's using OPeNDAP, so it should be possible to only download data on demand. However, I'm not really clear on how to engage dask with a single dataset loaded with OPeNDAP. Or maybe it depends on the native chunking in the dataset? |
I unfortunately don't have time to help investigate here right now (have you tried specifying |
Ah thanks. My question is probably more suited to xarray-metpy integration rather that dask-metpy integration |
Despite discussion of using other libraries to speed up MetPy, I would still like to see dask supported. I think most of the points brought up here haven't changed. Since the idea behind #1388 is an absolute beast, it may be best to set that aside and create tests only for dask arrays. I'm not sure what that would look like, so I'd like to know if anyone has thoughts on it. A simple approach might be to just add a test against dask for each function, module by module, but that will bloat up the test suite quite quickly. Is there a better solution? |
@rpmanser Well while we might not be looking to use Dask internally as a performance solution, we definitely do want to support its use for others who have gone to the effort of setting up their data to do so. What I wonder is if we could do something like (completely untested code): import pytest
parameterize_array_types = pytest.parameterize('array_type', [np.array, dask.array, np.ma.array])
@parameterize_array_types
def test_calc(array_type)
test_input = units.Quantity(array_type(data), 'myunits')
truth = units.Quantity(array_type(truth_data), 'newunits')
result = calc(test_input)
assert_almost_equal(test_input, truth) I haven't thought through what the challenges with that, but I think that would allow us to start testing things like Dask without copy-pasting a whole bunch of tests? |
@dopplershift that seems pretty straightforward and looks like it will work. I'll give that a try on the basic module in calc soon then we can go from there. |
So I am running into a related issue (or maybe even the same?) When calculating the wind direction from 2 3D dask arrays ( Line 104 in 026778c
I am seeing that |
@lpilz The problem is that dask doesn't support the following in general: a[a > 2] += 4 I haven't dug in to figure out why it doesn't or what it would take. I'm open to reworking our implementation of |
@dopplershift Thanks for explaining, that clarifies things. Could one maybe turn it on its head and do something like this or is the problem that bool arrays don't work? a += (a > 2)*4 |
@lpilz Honestly, I have no idea what the problem is. If a dask user with a vested interest 😉 were to submit a PR with that change and it didn't break any tests (and fixed dask) then I'd be willing to merge it. |
Not sure if this should go here or in a new issue (or even if this is an issue), but with this code: import dask.array as da
import metpy.calc as mpcalc
from metpy.units import units
p = da.asarray([1000, 900, 800, 700, 600, 500, 400, 300, 200, 100]) * units('hPa')
T = 20 * units('degC')
Td = 18 * units('degC')
prof = mpcalc.parcel_profile(p, T, Td) I get ominous warnings:
As far as I can tell, MetPy does not directly call the functions in question, but it would certainly be nice if this code did not stop working in a future release. |
@sgdecker I think it's good to capture it here. This particular function is high on our list to make sure it works with Dask for our current funded work on MetPy. Out of curiosity, what's the scale of the data you're using for your "real" problem? The example above with a single profile isn't really doing much with Dask. |
@dopplershift Yes, this is scaled down for example purposes. The real data is climate model output, so we are looking at something on the order of 10,000 gridpoints and thousands of times. |
@sgdecker I'll be sure to reach out when we have something, then. That's literally one of the core workflows/use cases that we're trying to support with the funded work we have. |
Hello curious if there is an update on Dask support? |
Nothing really new to report. I'm curious if you've tried using dask with MetPy and run into any calculations in particular that cause you problems? |
We haven't started anything yet but might be using MetPy to retrieve NexRAD data, filtering for specific variables, convert from polar to geographic coordinates, calculate variables with MetPy, and write the results to Zarr. It seems like one of the more user friendly packages for reading Level 2 or Level 3 data. |
You might want to consider xradar for that, at least for level 2, since that's reading nexrad level 2 directly into xarray. |
Thank you! Will take a look. |
As pieces for compatibility between Dask, Xarray, and Pint are coming together (e.g., xarray/#4221, pint/#1129, pint#1151, dask/dask#6393, among others that I likely missed) we should begin keeping track of progress for Dask Array support in MetPy. This might help address #996, or the previously mentioned PRs may already allow for it.
I started testing support for Dask Arrays in #1388, but that should probably be separated from that PR. Most functions in the
basic.py
module just work with Pint wrapped Dask Arrays without extra effort, but there are some exceptions. These includewind_direction
,heat_index
,apparent_temperature
(working onheat_index
should fix this), andpressure_to_height_std
. The first two/three are a result of ambiguous array sizes due to boolean indexing. For example inwind_direction
:MetPy/src/metpy/calc/basic.py
Lines 101 to 103 in 071c700
mask = wdir <= 0
produces a boolean array of undetermined size, which leads to an axis of undetermined size in the Dask Array, which is given anan
value e.g.,(50, nan, 100)
. This is expected Dask behavior AFAIK. That should be pretty easy to fix with annp.where
.I haven't dug into why
pressure_to_height_std
doesn't work. I'll see if I can do that soon.Another problem that is likely to come up, but I haven't dug into yet, involves scipy functions. I'm not sure how these will interact with Dask Arrays but there may need to be some special handling for that. Specifically I'm thinking of any vertical profile calculations (e.g., CAPE)... chunking is going to be very important for those.
Some discussion points:
metpy.calc
is a good place to start.wind_direction
). Changes in performance to accommodate Dask support should be considered for each case.cc: @jthielen @tjwixtrom
The text was updated successfully, but these errors were encountered: