Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for units #525

Open
p3trus opened this issue Aug 11, 2015 · 64 comments
Open

support for units #525

p3trus opened this issue Aug 11, 2015 · 64 comments
Labels
topic-arrays related to flexible array support topic-CF conventions

Comments

@p3trus
Copy link

p3trus commented Aug 11, 2015

One thing that bother me alot is that pandas lacks good unit support (e.g. quantities, pint, ...)

Is there a chance that xray will support it?

@shoyer
Copy link
Member

shoyer commented Aug 12, 2015

Eventually, I hope so!

Unfortunately, doing this in a feasible and maintainable way will probably require upstream fixes in NumPy. In particular, better support for duck-array types (numpy/numpy#4164) and/or the ability to write units as a custom NumPy dtypes. Both of these are on the NumPy roadmap, though they don't have a timeframe for when that will happen.

@rabernat
Copy link
Contributor

Astropy has pretty good units support:
http://astropy.readthedocs.org/en/latest/units/
Would it be possible to copy what they do?

@shoyer
Copy link
Member

shoyer commented Aug 17, 2015

Unfortunately, the astropy approach uses a numpy.ndarray subclass, which means it's mutually exclusive with dask.array. Otherwise, it does look very nice, though.

On Mon, Aug 17, 2015 at 8:38 AM, Ryan Abernathey [email protected]
wrote:

Astropy has pretty good units support:
http://astropy.readthedocs.org/en/latest/units/

Would it be possible to copy what they do?

Reply to this email directly or view it on GitHub:
#525 (comment)

@mhvk
Copy link

mhvk commented Sep 19, 2015

@shoyer - as one who thinks unit support is probably the single best thing astropy has (and is co-maintainer of astropy.units), I thought I'd pipe in: why would it be a problem that astropy's Quantity is an ndarray subclass? I must admit not having used dask arrays, but since they use ndarray internally for the pieces, shouldn't the fact that Quantity has the same interface/methods, make it relatively easy to swap ndarray for Quantity internally? I'd be quite happy to help think about this (surely it cannot be as bad as it is for MaskedArray ;-).

Alternatively, maybe it is easier to tag on the outside rather than the inside. This would also not seem to be that hard, given that astropy's Quantity is really just a wrapper around ndarray that carries a Unit instance. I think the parts that truly wrap might be separated from those that override ndarray methods, and would be willing to implement that if there is a good reason (like making dask quantities possible...). It may be that in this case one would not use Quantity proper, but rather just the parts of units where the real magic happens: in the Unit class (which does the unit conversion) and in quantity_helpers.py (which tells what unit conversion is necessary for a given operation/function).

@shoyer
Copy link
Member

shoyer commented Sep 21, 2015

@mhvk It would certainly be possible to extend dask.array to handle units, in either of the ways you suggest.

Although you could allow Quantity objects inside dask.arrays, I don't like that approach, because static checks like units really should be done only once when arrays are constructed (akin to dtype checks) rather than at evaluation time, and for every chunk. This suggests that tagging on the outside is the better approach.

So far, so good -- but with the current state of duck array typing in NumPy, it's really hard to be happy with this. Until __numpy_ufunc__ lands, we can't override operations like np.sqrt in a way that is remotely feasible for dask.arrays (we can't afford to load big arrays into memory). Likewise, we need overrides for standard numpy array utility functions like concatenate. But the worst part is that the lack of standard interfaces means that we lose the possibility of composing different arrays backends with your Quantity type -- it will only be able to wrap dask or numpy arrays, not sparse matrices or bolt arrays or some other type yet to be invented.

Once we have all that duck-array stuff, then yes, you certainly could write all a duck-array Quantity type that can wrap generic duck-arrays. But something like Quantity really only needs to override compute operations so that they can propagate dtypes -- there shouldn't be a need to override methods like concatenate. If you had an actual (parametric) dtype for units (e.g., Quantity[float64, 'meters']), then you would get all those dtype agnostic methods for free, which would make your life as an implementer much easier. Hence why I think custom dtypes would really be the ideal solution.

@mhvk
Copy link

mhvk commented Sep 21, 2015

@shoyer - fair enough, and sad we don't have __numpy_ufunc__ yet... I agree that with Quantity inside, one would end up duplicating work for every chunk, which makes it less than ideal even though it would probably be the easier approach to implement.

For the outside method, from the dask perspective, it would indeed be easiest if units were done as a dtype, since then you can punt all the decisions to helper routines. My guess, though, is that it will be a while before numpy will include what is required to tell, e.g., that if I add something in m to something in cm, the second argument has to be multiplied by 0.01. But astropy does provide something just like that: quantity_helpers exposes a dict keyed by operation, which holds functions that return the required converters given the units. E.g., in the above example, internally what happens is

converters, result_unit = UFUNC_HELPERS[np.add](np.add, *units)
result_unit
# Unit("m")
converters[0]
# None
converters[1]
# <function astropy.units.quantity_helper.get_converter.<locals>.<lambda>>
converters[1](1.)
# 0.01

In dask, you could run the converters on your individual chunks, though obviously I don't know how easy it is to add an extra step like this without slowing down other aspects too much.

@mhvk
Copy link

mhvk commented Sep 21, 2015

p.s. For concatenate, you need unit conversion as well, so sadly Quantity does need to override that too (and currently cannot, which is rather annoying).

@hgrecco
Copy link

hgrecco commented Feb 8, 2016

I am one of the authors of Pint and I was just pointed here by @arsenovic

Pint does not subclass ndarray, it rathers wrap any numerical type dispatching to the wrapped value any attribute access that does not understand. By defining __array_prepare__ and __array_wraps__ most numpy functions and array attributes work as expected without monkey patching or having a specialized math module. For example:

>>> import numpy as np
>>> import pint
>>> ureg = pint.UnitRegistry()
>>> [1., 4., 9.] * ureg.meter # a list is interpreted as an ndarray
<Quantity([1. 4. 9.], 'meter')>
>>> np.sqrt(_)
<Quantity([ 1.  2.  3.], 'meter ** 0.5')>
>>> _.sum()
<Quantity(6.0, 'meter ** 0.5')>

I think something similar can be done for xarray.

@mhvk
Copy link

mhvk commented Feb 9, 2016

@hgrecco - for astropy's Quantity, we currently also rely on __array_prepare__ and __array_wrap__. The main annoyances are (1) one cannot change the input before a numpy ufunc is called, and therefore often has no choice but to let a wrong calculation proceed; (2) proper recognition in non-ufunc functions is sparse (e.g., np.dot, etc.; see http://docs.astropy.org/en/latest/known_issues.html#quantity-issues)

Aside: at some point I'd hope to get the various implementations of units to talk together: it would be good to have an API that works such that units are inter-operable.

@shoyer
Copy link
Member

shoyer commented Feb 9, 2016

@hgrecco Are you suggesting that pint could wrap xarray objects, or that xarray could wrap pint? Either is certainly possible, though I'm a bit pessimistic that we can come up with a complete solution without the numpy fixes we've been discussing.

Also, just to note, xarray contains a Dataset type for representing collections of variables that often have different units (e.g., temperature and pressure). That suggests to me that it could make more sense to put pint and/or astropy.Quantity objects inside xarray arrays rather than the other way around.

@arsenovic
Copy link

id just like to chime in and say that this feature would really be sweet. i always find myself doing a lot work to handle/convert different units.
it seems that adding units to labeled axes does a lot to describe a set of data.

@hgrecco
Copy link

hgrecco commented Feb 10, 2016

@shoyer When we prototyped Pint we tried putting Quantity objects inside numpy array. It was was working fine but the performance and memory hit was too large. We were convinced that our current design was right when we wrote the first code using it. The case might be different with xarray. It would be nice to see some code using xarray and units (as if this was an already implemented feature).

@mhvk I do agree with your views. We also mention these limitations in the Pint documentation. Wrapping (instead of subclassing) adds another issue: some Numpy functions do not recognize a Quantity object as an array. Therefore any function that call numpy.asanyarray will erase the information that this is a quantity (See my issue here numpy/numpy#4072).

In any case, as was mentioned before in the thread Custom dtypes and Duck typing will be great for this.

In spite of this limitations, we chose wrapping because we want to support quantities even if NumPy is not installed. It has worked really nice for us, working in most of the common cases even for numpy arrays.

Regarding interoperating, it will be great. It will be even better if we can move into one, blessed, solution under the pydata umbrella (or similar).

@spencerahill
Copy link
Contributor

Not to be pedantic, but just one more 👍 on ultimately implementing units support within xarray -- that would be huge.

@shoyer
Copy link
Member

shoyer commented Feb 11, 2016

If anyone is excited about working on the NumPy improvement we need to make this more feasible (namely, custom dtypes and duck typing) at BIDS, you should talk to @njsmith.

@dopplershift
Copy link
Contributor

I agree that custom dtypes is the right solution (and I'll go dig some more there). In the meantime, I'm not sure why you couldn't wrap an xarray DataArray in one of pint's Quantity instances. With the exception of also wanting units on coordinates, this seems like a straightforward way to get at least some unit functionality.

@shoyer
Copy link
Member

shoyer commented Aug 27, 2016

#988 describes a possible approach for allowing third-party libraries to add units to xarray. It's not as comprehensive as a custom dtype, but might be enough to be useful.

@burnpanck
Copy link
Contributor

+1 for units support. I agree, parametrised dtypes would be the preferred solution, but I don't want to wait that long (I would be willing to contribute to that end, but I'm afraid that would exceed my knowledge of numpy).

I have never used dask. I understand that the support for dask arrays is a central feature for xarray. However, the way I see it, if one would put a (unit-aware) ndarray subclass into an xarray, then units should work out of the box. As you discussed, this seems not so easy to make work together with dask (particularly in a generic way). However, shouldn't that be an issue that the dask community anyway has to solve (i.e.: currently there is no way to use any units package together with dask, right)? In that sense, allowing such arrays inside xarrays would force users to choose between dask and units, which is something they have to do anyway. But for a big part of users, that would be a very quick way to units!

Or am I missing something here? I'll just try to monkeypatch xarray to that end, and see how far I get...

@shoyer
Copy link
Member

shoyer commented Sep 19, 2016

@burnpanck Take a look at the approach described in #988 and let me know if you think that sounds viable.

NumPy subclasses inside xarray objects would probably mostly work, if we changed some internal uses of np.asarray to np.asanyarray. But it's also a pretty big rabbit hole. I'm still not sure there are any good ways to do operations like concatenate.

@burnpanck
Copy link
Contributor

#988 would certainly allow to me to implement unit functionality on xarray, probably by leveraging an existing units package.
What I don't like with that approach is the fact that I essentially end up with a separate distinct implementation of units. I am afraid that I will either have to re-implement many of the helpers that I wrote to work with physical quantities to be xarray aware. Furthermore, one important aspect of units packages is that it prevents you from doing conversion mistakes. But that only works as long as you don't forget to carry the units with you. Having units just as attributes to xarray makes it as simple as forgetting to read the attributes when accessing the data to lose the units.
The units inside xarray approach would have the advantage that whenever you end up accessing the data inside xarray, you automatically have the units with you.
From a conceptual point of view, the units are really an integral part of the data, so they should sit right there with the data. Whenever you do something with the data, you have to deal with the units. That is true no matter if it is implemented as an attribute handler or directly on the data array. My fear is, attributes leave the impression of "optional" metadata which are too easily lost. E.g. xarray doesn't call it's ufunc_hook for some operation where it should, and you silently lose units. My hope is that with nested arrays that carry units, you would instead fail verbosely. Of course, np.concatenate is precisely one of these cases where unit packages struggle with to get their hook in (and where units on dtypes would help). So they fight the same problem. Nonetheless, these problems are known and solved as well as possible in the units packages, but in xarray, one would have to deal with them all over again.

@burnpanck
Copy link
Contributor

burnpanck commented Sep 20, 2016

Or another way to put it: While typical metadata/attributes are only relevant if you eventually read them (which is where you will notice if they were lost on the way), units are different: They work silently behind the scene at all times, even if you do not explicitly look for them. You want an addition to fail if units don't match, without having to explicitly first test if the operands have units. So what should the ufunc_hook do if it finds two Variables that don't seem to carry units, raise an exception? Most probably not, as that would prevent to use xarray at the same time without units. So if the units are lost on the way, you might never notice, but end up with wrong data. To me, that is just not unlikely enough to happen given the damage it can do (e.g. the time it takes to find out what's going on once you realise you get wrong data).

@burnpanck
Copy link
Contributor

So for now, I'm hunting for np.asarray.

@mhvk
Copy link

mhvk commented Sep 20, 2016

@burnpanck - thanks for the very well-posed description of why units are so useful not as some meta-data, but as an integral property. Of course, this is also why making them part of a new dtype is a great idea! But failing that, I'd agree that it has to be part of something like an ndarray subclass; this is indeed what we do in astropy.units.Quantity (and concatenate does not work for us either...).

Now, off-topic but still: what is a little less wonderful is that there seem to be quite a few independent units implementations around (even just in astronomy, there is that of amuse; ours is based on things initially developped by pynbody). It may well be hard to merge them at this stage, but it would be good to think how we could at least interoperate...

@jthielen
Copy link
Contributor

Based the points raised by @crusaderky in hgrecco/pint#878 (comment) about how much special case handling xarray has for dask arrays, I was thinking recently about what it might take for the xarray > pint > dask.array wrapping discussed here and elsewhere to work as fluidly as xarray > dask.array currently does.

Would it help for this integration to have pint Quanitites implement the dask custom collections interface for when it wraps a dask array? I would think that this would allow a pint Quanitity to behave in a "dask-array-like" way rather than just an "array-like" way. Then, instead of xarray checking for isinstance(dask_array_type), it could for check for "duck dask arrays" (e.g., those with both __array_function__ and __dask_graph__)? There are almost certainly some subtle implementation details that would need to be worked out, but I'm guessing that this could take care of the bulk of the integration.

Also, if I'm incorrect with this line of thought, or there is a better way forward for implementing this wrapping pattern, please do let me know!

@shoyer
Copy link
Member

shoyer commented Sep 15, 2019

Would it help for this integration to have pint Quanitites implement the dask custom collections interface for when it wraps a dask array? ... Then, instead of xarray checking for isinstance(dask_array_type), it could for check for "duck dask arrays" (e.g., those with both __array_function__ and __dask_graph__)?

Yes, great idea!

@ngoldbaum
Copy link

Hi all, I was just pointed at this by someone who went to the NumFOCUS summit. I'm the lead developer for unyt, which I see has come up a little bit. If anyone wants to chat with me about adding support for unyt along with Pint I'd be happy to discuss.

@keewis
Copy link
Collaborator

keewis commented Dec 8, 2019

sorry for the late reply, @ngoldbaum. For xarray to support unyt, it would have to implement NEP-18 (i.e. __array_function__) which I think it does not yet? I would expect to have unyt support come for free once pint support works so I would focus on that first (see #3594 for the current progress). Extending the tests to also test unyt would be possible but I'm thinking it would be better to put that into the accessor package (as discussed above for a possible pint accessor)?

There were a few unresolved design questions with how unit libraries should implement certain numpy functions (e.g. how where should behave when receiving array(nan) and array(0) which xarray uses to implement nanops, or which unit the result of full_like should have): see hgrecco/pint#905. Any supported unit library would have to behave the same way so I think it would be good to coordinate that. Thoughts?

@StanczakDominik
Copy link
Contributor

Hi! I'm just popping in as a very interested user of both xarray and unit packages to ask: since there's been some awesome progress made here and pint-xarray is now enough of A Thing to have documentation, though obviously experimental - how much work would you expect a corresponding package for astropy's Quantities to take, given the current state of things?

Are there any limitations that would prevent that? I saw the discussion above about Quantities being more problematic due to taking the subclass-from-numpy-arrays route, but I'm not sure how much of a roadblock that still is. I would suspect the API could be shared with pint-xarray (which, obviously, is experimental for now).

@keewis
Copy link
Collaborator

keewis commented Nov 25, 2020

I saw the discussion above about Quantities being more problematic

I would expect astropy quantities to work just fine as long as they are duck arrays. Also, the only limitation I would expect them to have is that they can't "wrap" anything other than numpy arrays. dask arrays are an exception since they can wrap quantities using the _meta attribute (which is something we try to avoid with pint, though).

For reference, the currently remaining issues for all duck arrays (except obviously dask) are documented here

@keewis
Copy link
Collaborator

keewis commented Jan 19, 2021

I would expect astropy quantities to work just fine as long as they are duck arrays

actually, that turns out to be wrong. Since isinstance(data, np.ndarray) returns true for astropy.units.Quantity, it is cast to ndarray using np.asarray:

if not isinstance(data, np.ndarray):
if hasattr(data, "__array_function__"):
if IS_NEP18_ACTIVE:
return data
else:
raise TypeError(
"Got an NumPy-like array type providing the "
"__array_function__ protocol but NEP18 is not enabled. "
"Check that numpy >= v1.16 and that the environment "
'variable "NUMPY_EXPERIMENTAL_ARRAY_FUNCTION" is set to '
'"1"'
)
# validate whether the data is valid data types.
data = np.asarray(data)

Adding or issubclass(type(data), np.ndarray) or type(data) != np.ndarray does allow wrapping a astropy.units quantity in Dataset / DataArray objects but it breaks a few tests. Also, unless we modify the testsuite in xarray/tests/test_units.py to run with astropy.units instead of pint I can't really tell which features of xarray strip the units (in addition to the ones documented for pint). For that, we probably need to somehow create a generalization of the tests for duck arrays.

@riley-brady
Copy link

I believe this can be closed via #5734 @andersy005?

@tien-vo
Copy link
Contributor

tien-vo commented Nov 2, 2024

if isinstance(data, NON_NUMPY_SUPPORTED_ARRAY_TYPES):
return convert_non_numpy_type(data)
if isinstance(data, tuple):
data = utils.to_0d_object_array(data)
if isinstance(data, pd.Timestamp):
# TODO: convert, handle datetime objects, too
data = np.datetime64(data.value, "ns")
if isinstance(data, timedelta):
data = np.timedelta64(getattr(data, "value", data), "ns")
# we don't want nested self-described arrays
if isinstance(data, pd.Series | pd.DataFrame):
pandas_data = data.values
if isinstance(pandas_data, NON_NUMPY_SUPPORTED_ARRAY_TYPES):
return convert_non_numpy_type(pandas_data)
else:
data = pandas_data
if isinstance(data, np.ma.MaskedArray):
mask = np.ma.getmaskarray(data)
if mask.any():
dtype, fill_value = dtypes.maybe_promote(data.dtype)
data = duck_array_ops.where_method(data, ~mask, fill_value)
else:
data = np.asarray(data)
# immediately return array-like types except `numpy.ndarray` subclasses and `numpy` scalars
if not isinstance(data, np.ndarray | np.generic) and (
hasattr(data, "__array_function__") or hasattr(data, "__array_namespace__")
):
return cast("T_DuckArray", data)
# validate whether the data is valid data types. Also, explicitly cast `numpy`
# subclasses and `numpy` scalars to `numpy.ndarray`
data = np.asarray(data)

@keewis it looks like adding an if block before line 323 would allow astropy quantities to be wrapped as you said, i.e.,

if isinstance(data, u.Quantity):
    return cast("T_DuckArray", data)

what else is needed to make this happen? I'd love to make some progress towards xarray wrapping astropy.units. Should #7799 be picked up again since @dstansby can't work on it anymore?

@keewis
Copy link
Collaborator

keewis commented Nov 2, 2024

We've since made some progress with the automatic generation of tests for arbitrary duck arrays in xarray-array-testing, so it might be better to work on that instead of trying to generalize the massive (and probably stale) testsuite for testing unit support with pint.

That said, that's for testing the support, not the support itself. Implementing that should be fairly easy, we should just make sure to continue banning numpy.matrix and numpy.ma.masked_array (I think those were the ones we explicitly wanted to convert to numpy.ndarray in #2956). Either way, feel free to open a dedicated issue to track astropy.units support.

Edit: actually, numpy.ma.MaskedArray is special-cased immediately above the block for duck arrays, so I believe the only thing we have to make sure is to not pass through numpy.matrix instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support topic-CF conventions
Projects
None yet
Development

No branches or pull requests