Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray.map #4484

Closed
wants to merge 14 commits into from
Closed

xarray.map #4484

wants to merge 14 commits into from

Conversation

kefirbandi
Copy link
Contributor

@kefirbandi kefirbandi commented Oct 4, 2020

UPDATE:
Please let me know whether this PR can be considered to be merged.
If not I won't bother trying to fix failed tests such as:

would reformat /home/vsts/work/1/s/xarray/core/map.py
would reformat /home/vsts/work/1/s/xarray/core/dataset.py
would reformat /home/vsts/work/1/s/xarray/tests/test_map.py
Oh no! 💥 💔 💥

Thanks


  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

I implemented the top-level xarray.map function.
It is a generalization of the Dataset.map method for those functions which take more than one DataArray as input.
The function will be applied to the intersection of the variables in the datasets.

E.g:

f = lambda a, b: b-a
map([ds1, ds2], f)

More details in the docstring.

(I probably messed up something with git as the commits listed below also include my earlier commits not related to this PR. But the list of files is clean, it only includes what I'd like to merge)

@pep8speaks
Copy link

pep8speaks commented Oct 4, 2020

Hello @kefirbandi! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-04 04:43:38 UTC

@max-sixty
Copy link
Collaborator

Hi @kefirbandi , thanks for the PR!

Could I ask what the common use cases for this would be? If I understand correctly, running map(x, y, lambda x: x + y) is equivalent to x + y.

@shoyer
Copy link
Member

shoyer commented Oct 6, 2020

If we're doing to do this, I would suggest that the right signature is xarray.map(func, *datasets, **optional_kwargs), matching Python's builtin map.

shared_variable_names = set.intersection(*(set(ds.data_vars) for ds in datasets))
for k in shared_variable_names:
data_arrays = [d[k] for d in datasets]
v = maybe_wrap_array(datasets[0][k], func(*(data_arrays + list(args)), **kwargs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not use maybe_wrap_array in new functions. It exists in map only for legacy reasons, but we'd really like to put that sort of functionality into a separate dedicated function (e.g., see the proposal for apply_raw in #1618).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem for me. I copied the existing behavior without actually really paying attention as to what it is doing an why.

@shoyer
Copy link
Member

shoyer commented Oct 6, 2020

I think this could make sense as a generalization of Dataset.map. I would indeed be interested to hear more about the specific use cases that motivated this, though.

@kefirbandi
Copy link
Contributor Author

Could I ask what the common use cases for this would be? If I understand correctly, running map(x, y, lambda x: x + y) is equivalent to x + y.

The motivating use case was that I wanted to compute the dot-product of two DataSets (=all of their matching variables).
But in general any other function which is not as simple as x + y could be used here.

@max-sixty
Copy link
Collaborator

The motivating use case was that I wanted to compute the dot-product of two DataSets (=all of their matching variables).

Thanks, that's a good case.

A couple of thoughts:

  • Are there many other cases outside of xr.dot which only operate on DataArrays? If not, we could update that function to take a Dataset
  • Along those lines, I wonder whether this becomes an underlying function — functions which currently operate only on DataArrays could have Dataset inputs run through this.
  • NB: In many of the computation functions, the logic runs the other way (we define the DataArray function to be the Dataset operation on a single-variable Dataset); having a more consistent way of doing this could make extending functions between DataArrays & Datasets easier.
  • Maybe jumping ahead — are there functions where the result of func(ds1, ds2) shouldn't be that function mapped over the matching variables?

@dcherian
Copy link
Contributor

dcherian commented Oct 7, 2020

Are there many other cases outside of xr.dot which only operate on DataArrays? If not, we could update that function to take a Dataset

I think we should do this.

@kefirbandi
Copy link
Contributor Author

* Are there many other cases outside of `xr.dot` which only operate on `DataArray`s? If not, we could update that function to take a `Dataset`

I think it would be a good idea to extend dot to Datasets. However a user may wish to map a custom DataArray function to Dataset.

* Maybe jumping ahead — are there functions where the result of `func(ds1, ds2)` shouldn't be that function mapped over the matching variables?

Not sure of the context of this. In the most general case one can certainly implement any function on ds1 and ds2. Or are you referring to the built-ins such as .dot?

@shoyer
Copy link
Member

shoyer commented Oct 7, 2020

I think some version of xarray.map could indeed be pretty useful more generally. In particular, it could implement each of the ways to "align" variables between different datasets, e.g., should we use the intersection of the variables, or the union, inserting dummy variables with fill values? The dataset_join argument to xarray.apply_ufunc controls this behavior, but it would be nice to have a self-contained version of this.

One challenge for using this internally in xarray (vs. implementing things only on Dataset objects) is that the coordinates on the DataArray objects inside a Dataset can be redundant, so mapping over DataArrays may be less efficient, due to duplicate work on coordinates. In contrast, once everything is in Dataset objects, all the variables/coordinates are pre-aligned and de-deduplicated.

@kefirbandi
Copy link
Contributor Author

If we're doing to do this, I would suggest that the right signature is xarray.map(func, *datasets, **optional_kwargs), matching Python's builtin map.

What I'd like to ensure is a clean separation between the arguments of xarray.map and func. Map has three "own" parameters, like func, datasets and keep_attrs. By using the **kwargs approach we are excluding these parameter names from func. Not saying that is likely that anyone would apply a function with such parameter names. But not impossible either.
Also having a real dict for keyword args (and maybe a list for positional arguments of func) is more explicit.

In my implementation the order of parameters is datasets and then func to match that of Dataset.map with the implicit self.
But probably func and then the datasets is more intuitive.

@max-sixty
Copy link
Collaborator

It's not great for us as reviewers that we let this sit idle — sorry @kefirbandi ....

But reviewing it now, I would vote to have func(ds1, ds2) work for all our functions, rather than have a new top-level method xr.map. xr.apply_ufunc will do that for us IIUC — i.e. it can operate as an xr.map internally.

If others agree, we could close this, and potentially open an issue for any methods which don't have that interface (which is maybe only xr.dot?)

@kefirbandi
Copy link
Contributor Author

I'm ok with that.

@max-sixty max-sixty closed this Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants