Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pint support for variables #3706

Merged
merged 23 commits into from
Feb 23, 2020
Merged

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Jan 17, 2020

I realized that most of the operations depend on Variable, which means that this should be the first step to making the other tests pass. This PR copies from #3611, making that PR a work-in-progress again until I remove those parts.

These are the current failures:

  • np.prod: not implemented by pint yet, but should work once it is
  • comparisons: identical fails to detect same value, but in different units (like 1 m and 100 cm) as different. This is hard to implement in a way that does not include isinstance or hasattr checks.
  • rank: not implemented for non-ndarrays, so maybe we should mark this as skip?
  • rolling_window: nputils._rolling_window uses np.lib.stride_tricks.as_strided, which cannot be overridden by pint. We probably have to use something different?
  • shift: this tries to trim, then concatenate a filled array, but I think this should use np.pad after trimming? Are there any disadvantages to that? it uses numpy.pad now which is supported since dask==0.18 or dask==0.19
  • concat: this was a misconception on my part, I didn't realize this was a classmethod. After fixing that this still fails because I assumed Variable.concat used the dimension names to reshape the arrays. It does not, so this fails: I rewrote the test to pass arrays that don't cause the failure
xr.Variable.concat([xr.Variable(("x", "y"), ...), xr.Variable(("y", "z"), ...)], dim="y")

Does anyone have any comments on these before I start fixing them? @dcherian?

@keewis
Copy link
Collaborator Author

keewis commented Jan 19, 2020

I fixed the tests that were easier, but I have no idea about rolling_window / as_strided because I don't understand how it works, and identical would be easy to fix if we special-cased pint, but I'd like to avoid that

Edit: seems there are some issues with np.pad support in dask==1.2: __array_function__ support is probably only in later versions so we'd need to use da.pad instead, but that raises an error, too.

@keewis
Copy link
Collaborator Author

keewis commented Jan 20, 2020

I'm not sure why, but dask==1.2 fails where later versions work:

ValueError: operands could not be broadcast together with shapes (2,0) (2,2)

Edit: Seems we can just bump dask to dask==2.1 or dask==2.2 and not worry about this. I'd wait on #3660 with this.

@keewis keewis requested a review from dcherian January 20, 2020 17:16
@keewis
Copy link
Collaborator Author

keewis commented Jan 23, 2020

the easiest way to fix identical would be to pass a equiv function that forwards to duck_array_ops.array_equiv but also checks the units:

def units_equiv(a, b):
    if hasattr(a, "units") or hasattr(b, "units"):
        units_a = getattr(a, "units", None)
        units_b = getattr(b, "units", None)

	registry = getattr(units_a, "_REGISTRY", None) or getattr(b, "_REGISTRY", None)

	units_a = units_a or registry.dimensionless
	units_b = units_b or registry.dimensionless

	return units_a == units_b
    else:
	# no units at all
	return True

def equiv_with_units(a, b):
    return duck_array_ops.array_equiv(a, b) and units_equiv(a, b)

However, I'm out of ideas on how to automatically use that (I'd like to avoid having to call a.identical(b, equiv=equiv_with_units)).


There is also rolling_window that strips the units due to calling numpy.lib.stride_tricks.as_strided, but I don't know how to fix that unless we rewrite _rolling_window or dispatch differently in rolling_window.

There seems to have been some work to add a function named numpy.rolling_window or numpy.sliding_window that could then be overridden by pint, but I think that effort has stalled?

@dcherian
Copy link
Contributor

I think your units_equiv thing could go in duck_array_ops.lazy_array_equiv which tries to check everything but the actual values themselves.

but I don't know how to fix that unless we rewrite _rolling_window or dispatch differently in rolling_window.

This may be necessary but perhaps not the most important thing right now?

@keewis
Copy link
Collaborator Author

keewis commented Jan 23, 2020

I think your units_equiv thing could go in duck_array_ops.lazy_array_equiv

that could work for Dataset.identical but not Variable.identical, which I believe by default directly calls array_equiv

re rolling_window: should I leave it as xfail for now?

@dcherian
Copy link
Contributor

array_equiv calls lazy_array_equiv before doing much else:

def array_equiv(arr1, arr2):
"""Like np.array_equal, but also allows values to be NaN in both arrays
"""
arr1 = asarray(arr1)
arr2 = asarray(arr2)
lazy_equiv = lazy_array_equiv(arr1, arr2)
if lazy_equiv is None:
with warnings.catch_warnings():
warnings.filterwarnings("ignore", "In the future, 'NAT == x'")
flag_array = (arr1 == arr2) | (isnull(arr1) & isnull(arr2))
return bool(flag_array.all())
else:
return lazy_equiv

@dcherian
Copy link
Contributor

re rolling_window: should I leave it as xfail for now?

Fine by me. :)

@keewis
Copy link
Collaborator Author

keewis commented Jan 23, 2020

hrmm... I had investigated this before and thought I remembered correctly. I can't put it in lazy_array_equiv, though, since array_equiv is also used by equals (which should not check the units, only the dimensionality)

@dcherian
Copy link
Contributor

I guess we need to either pass a kwarg regarding checking units/dimensionality down to lazy_array_equiv or add a units_equiv(only_dimensionality=True/False) check to

xarray/xarray/core/variable.py

Lines 1645 to 1648 in 6d1434e

try:
return self.dims == other.dims and (
self._data is other._data or equiv(self.data, other.data)
)

I think a separate units_equiv function may be cleaner?

Note that we explicitly use lazy_array_equiv in concat so it'd be nice to have something that could be easily used there too:

for k in getattr(datasets[0], subset):
if k not in concat_over:
equals[k] = None
variables = [ds.variables[k] for ds in datasets]
# first check without comparing values i.e. no computes
for var in variables[1:]:
equals[k] = getattr(variables[0], compat)(
var, equiv=lazy_array_equiv
)
if equals[k] is not True:
# exit early if we know these are not equal or that
# equality cannot be determined i.e. one or all of
# the variables wraps a numpy array
break

@keewis
Copy link
Collaborator Author

keewis commented Jan 24, 2020

I think a separate units_equiv function may be cleaner?

I agree. However, I'm reluctant to add that function since that would be the first pint dependent code we have except from the unit tests (but do tell me if that's not an issue).

I'm inclined to provide some kind of hook for wrapped libraries instead (allow them to define a method like metadata_identical or something that does the check for us) so the code in identical would become something like

def metadata_identical(arr1, arr2):
    if hasattr(arr1, "metadata_identical"):
        return arr1.metadata_identical(arr2)
    elif hasattr(arr2, "metadata_identical"):
        return arr2.metadata_identical(arr1)
    else:
        return True

return (
    self.dims == other.dims
    and (self._data is other._data or equiv(self.data, other.data))
    and metadata_identical(self.data, other.data)
)

Note that we explicitly use lazy_array_equiv in concat so it'd be nice to have something that could be easily used there too

I don't think that's a problem since what calls lazy_array_equiv is Variable.identical and if identical works correctly this should, too.

@dcherian
Copy link
Contributor

OK let's see what @shoyer thinks

@keewis
Copy link
Collaborator Author

keewis commented Feb 2, 2020

gentle ping, @shoyer

@keewis
Copy link
Collaborator Author

keewis commented Feb 5, 2020

these issues (rolling_window and identical) seem too big to discuss here, so in order to keep moving forward let's mark these tests as xfail and discuss this in their own issues / PRs.

@dcherian
Copy link
Contributor

dcherian commented Feb 5, 2020

let's mark these tests as xfail and discuss this in their own issues / PRs.

👍 to smaller PRs!

@keewis keewis changed the title WIP: Pint support for variables Pint support for variables Feb 5, 2020
xarray/core/variable.py Outdated Show resolved Hide resolved
xarray/core/duck_array_ops.py Show resolved Hide resolved
if isinstance(mask, bool):
mask = not mask
else:
mask = ~mask
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you flip the argument order rather than adding this? I’m a little puzzles here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the concern here is about consistency when applying ~ to bool objects and boolean dtype arrays, explicitly calling np.logical_not is a good alternative.

But it does feel a little weird to me to see this here. Maybe changing duck_array_ops.notnull would have the same effect?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as the fillna issue above, in order to get the results in the units of data, we need to flip the arguments and for that I need invert the mask (if there is a different way to flip the arguments without inverting, please do tell me).

I tried to use mask = ~mask, but ~ does not work as expected for bool. I'll use np.logical_not instead.

@shoyer
Copy link
Member

shoyer commented Feb 7, 2020

Thanks for pinging me again here (I get a lot of GitHub notifications). identical is an interesting case!

I think the current behavior (1 meter is identical to 100 centimeters) is arguably consistent with how identical currently works, which only check equality between array elements.

Right now, identical considers numbers of different data types equal, e.g., int 1 is identical to float 1.0``. I think units arguably have a similar to role to data types -- hopefully eventually libraries like pint could be implemented via custom NumPy dtypes, rather than needing to reimplement all of NumPy.

Did this come up in the context of some other downstream use-case, or is this just something that occurred to you for the sake of consistency?

@keewis
Copy link
Collaborator Author

keewis commented Feb 7, 2020

when writing the unit tests, we decided the definition of equals to be different from identical (see #3238 (comment) and #3238 (comment)) which would be consistent with the behaviour of identical when a "units" attribute is set.

As far as I know there was no real use case (except from being able to use assert_identical in the unit tests), so we can change that.

cc @jthielen

@jthielen
Copy link
Contributor

jthielen commented Feb 7, 2020

@keewis For what it is worth, as far as identical goes, I think it makes the most sense to treat unit matching like dtype matching as @shoyer mentioned. Although, I had interpreted @max-sixty's comment #3238 (comment) to mean that dtypes are compared, it appears from @shoyer's comment #3706 (comment) that this not the case. If strict unit checking is required, I think that may be better served by an additional assert unit == "meter" type statement.

@keewis
Copy link
Collaborator Author

keewis commented Feb 19, 2020

If strict unit checking is required, I think that may be better served by an additional assert unit == "meter" type statement.

which is what I've been doing with assert_units_equal. I'll change the tests for identical, then.

Also, concerning the commutative operations: should we wait for hgrecco/pint#1019 and remove the flipped parameters or should we merge as is and possibly revert once pint implements a type casting hierarchy?

@shoyer
Copy link
Member

shoyer commented Feb 21, 2020

Also, concerning the commutative operations: should we wait for hgrecco/pint#1019 and remove the flipped parameters or should we merge as is and possibly revert once pint implements a type casting hierarchy?

I don't anticipate any performance cost to this, just a small decrease in readability. So I think this is fine to merge for now with comments in the relevant sections and we can revert it later. My only suggestion is to add a note like TODO: revert after https://github.com/hgrecco/pint/issues/1019 is fixed to each comment.

@max-sixty
Copy link
Collaborator

Although, I had interpreted @max-sixty's comment #3238 (comment) to mean that dtypes are compared, it appears from @shoyer's comment #3706 (comment) that this not the case.

I was wrong; I should have at least realized I didn't know. Apologies if that caused wasted time @jthielen

Separately: should assert_identical assert that the dtypes are the same? I'd have thought there should be some way of testing whether dtypes are consistent with expectations, and I'd have thought assert_identical would be it?

xarray/tests/test_units.py Outdated Show resolved Hide resolved
@keewis
Copy link
Collaborator Author

keewis commented Feb 23, 2020

@shoyer, I added the notes

@max-sixty, @jthielen: the identical tests will be skipped for now. At the moment that does not make any difference since identical is the same as equals with additionally checking the attributes (any changes to it are not limited to pint, so I guess we should open a new issue for discussion).

@dcherian
Copy link
Contributor

Test failures look unrelated. Thanks @keewis

@dcherian dcherian merged commit 47476eb into pydata:master Feb 23, 2020
@keewis keewis deleted the pint-support-variables branch February 24, 2020 00:09
dcherian added a commit to johnomotani/xarray that referenced this pull request Feb 24, 2020
…under

* upstream/master: (71 commits)
  Optimize isel for lazy array equality checking (pydata#3588)
  pin msgpack (pydata#3793)
  concat now handles non-dim coordinates only present in one dataset (pydata#3769)
  Add new h5netcdf backend phony_dims kwarg (pydata#3753)
  always use dask_array_type for isinstance calls (pydata#3787)
  allow formatting the diff of ndarray attributes (pydata#3728)
  Pint support for variables (pydata#3706)
  Format issue template comment as md comment (pydata#3790)
  Avoid running test_open_mfdataset_list_attr without dask (pydata#3780)
  remove seaborn.apionly compatibility (pydata#3749)
  Python 3.8 CI (pydata#3727)
  PKG: Explicitly add setuptools dependency (pydata#3628)
  update whats-new
  Typo in Universal Functions section (pydata#3663)
  Release v0.15.0
  fix setup.cfg
  Documentation fixes (pydata#3732)
  Remove extra && in PR template (pydata#3730)
  Remove garbage text inserted in DASK_LICENSE (pydata#3729)
  Avoid unsafe use of pip (pydata#3726)
  ...
@keewis keewis mentioned this pull request Jul 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New TestVariable.test_pad failure with pint 0.11
5 participants