Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement polyfit? #3349

Closed
rabernat opened this issue Sep 27, 2019 · 25 comments · Fixed by #3733
Closed

Implement polyfit? #3349

rabernat opened this issue Sep 27, 2019 · 25 comments · Fixed by #3733

Comments

@rabernat
Copy link
Contributor

Fitting a line (or curve) to data along a specified axis is a long-standing need of xarray users. There are many blog posts and SO questions about how to do it:

The main use case in my domain is finding the temporal trend on a 3D variable (e.g. temperature in time, lon, lat).

Yes, you can do it with apply_ufunc, but apply_ufunc is inaccessibly complex for many users. Much of our existing API could be removed and replaced with apply_ufunc calls, but that doesn't mean we should do it.

I am proposing we add a Dataarray method called polyfit. It would work like this:

x_ = np.linspace(0, 1, 10)
y_ = np.arange(5)
a_ = np.cos(y_)

x = xr.DataArray(x_, dims=['x'], coords={'x': x_})
a = xr.DataArray(a_, dims=['y'])
f = a*x
p = f.polyfit(dim='x', deg=1)

# equivalent numpy code
p_ = np.polyfit(x_, f.values.transpose(), 1)
np.testing.assert_allclose(p_[0], a_)

Numpy's polyfit function is already vectorized in the sense that it accepts 1D x and 2D y, performing the fit independently over each column of y. To extend this to ND, we would just need to reshape the data going in and out of the function. We do this already in other packages. For dask, we could simply require that the dimension over which the fit is calculated be contiguous, and then call map_blocks.

Thoughts?

@dcherian
Copy link
Contributor

dcherian commented Sep 27, 2019

dask has lstsq https://docs.dask.org/en/latest/array-api.html#dask.array.linalg.lstsq . Would that avoid the dimension-must-have-one-chunk issue?

EDIT: I am in favour of adding this. It's a common use case like differentiate and integrate

@TomNicholas
Copy link
Member

I am in favour of adding this (and other common functionality), but I would comment that perhaps we should move forward with discussion about where to put extra functionality generally (the scipy to xarray's numpy if you will)? If only because otherwise the API could get to an unwieldy size?

I can't remember where the relevant issue was, but for example this might go under an xarray.utils module?

@darothen
Copy link

I second @TomNicholas' point... functionality like this would be wonderful to have but where would be the best place for it to live?

@rabernat
Copy link
Contributor Author

The question of a standalone library has come up many times (see discussion in #1850). Everyone agrees it's a nice idea, but no one seems to have the bandwidth to take on ownership and maintenance of such a project.

Perhaps we need to put this issue on pause and figure out a general strategy here. The current Xarray API is far from a complete feature set, so more development is needed. But we should decide what belongs in xarray and what belongs elsewhere. #1850 is probably the best place to continue that conversation.

@dcherian
Copy link
Contributor

The quickest way to close this is probably to extend @fujiisoup's xr-scipy(https://github.com/fujiisoup/xr-scipy) to wrap scipy.linalg.lstsq and dask.array.linalg.lstsq. It is likely that all the necessary helper functions already exist.

@rabernat
Copy link
Contributor Author

Now that xarray itself has interpolate, gradient, and integrate, it seems like the only thing left in xr-scipy is fourier transforms, which is also what we provide in xrft! 😆

@shoyer
Copy link
Member

shoyer commented Sep 30, 2019

From a user perspective, I think people prefer to find stuff in one place.

From a maintainer perspective, as long as it's somewhat domain agnostic (e.g., "physical sciences" rather than "oceanography") and written to a reasonable level of code quality, I think it's fine to toss it into xarray. "Already exists in NumPy/SciPy" is probably a reasonable proxy for the former.

So I say: yes, let's toss in polyfit, along with fast fourier transforms.

If we're concerned about clutter, we can put stuff in a dedicated namespace, e.g., xarray.wrappers.

@rabernat
Copy link
Contributor Author

rabernat commented Oct 2, 2019

I'm getting deja-vu here... Xscale has a huge and impressive sounding API. But no code coverage and no commits since January. Like many of these projects, it seems to have bit off more than its maintainers could chew.

Edit: I'd love for such a package to really achieve community uptake and become sustainable. I just don't quite know the roadmap for getting there.

@shoyer
Copy link
Member

shoyer commented Oct 2, 2019 via email

@dcherian
Copy link
Contributor

dcherian commented Dec 3, 2019

xyzpy has a polyfit too : https://xyzpy.readthedocs.io/en/latest/manipulate.html

@huard
Copy link
Contributor

huard commented Dec 13, 2019

Started to work on this and facing some issues with the x-coordinate when its a datetime. For standard calendars, I can use pd.to_numeric(da.time), but for non-standard calendars, it's not clear how to go ahead. If I use xr.coding.times.encode_cf_datetime(coord), the coefficients we'll find will only make sense in the polyval function if we use the same time encoding.

@spencerkclark
Copy link
Member

If I understand correctly, pd.to_numeric (and its inverse) works because it always uses 1970-01-01T00:00:00 as the reference date. Could we do something similar when working with cftime dates?

Within xarray we typically convert dates to numeric values (e.g. when doing interpolation) using xarray.core.duck_array_ops.datetime_to_numeric, which takes an optional offset argument to control the reference date. Would it work to always make sure to pass 1970-01-01T00:00:00 with the appropriate calendar type as the offset when constructing the ordinal x-coordinate for polyfit/polyval?

@huard
Copy link
Contributor

huard commented Dec 13, 2019

Thanks, it seems to work !

@spencerkclark
Copy link
Member

Excellent, looking forward to seeing it in a PR!

@huard
Copy link
Contributor

huard commented Dec 13, 2019

My current implementation is pretty naive. It's just calling numpy.polyfit using dask.array.apply_along_axis. Happy to put that in a PR as a starting point, but there are a couple of questions I had:

  • How to return the full output (residuals, rank, singular_values, rcond) ? A tuple of dataarrays or a dataset ?
  • Do we want to use the dask least square functionality to allow for chunking within the x dimension ? Then it's not just a simple wrapper around polyfit.
  • Should we use np.polyfit or np.polynomial.polynomial.polyfit ?

@maboualidev
Copy link

maboualidev commented Dec 14, 2019

geocat.comp.ndpolyfit extends NumPy.polyfit for multi-dimensional arrays and has support for Xarray and Dask. It does exactly what is requested here.

regards,

@andersy005 @clyne @matt-long @khallock

@huard
Copy link
Contributor

huard commented Dec 14, 2019

@maboualidev Nice ! I see you're storing the residuals in the DataArray attributes. From my perspective, it would be useful to have those directly as DataArrays. Thoughts ?

So it looks like there are multiple inspirations to draw from. Here is what I could gather.

  • xscale.signal.fitting.polyfit(obj, deg=1, dim=None, coord=None) supports chunking along the fitting dimension using dask.array.linalg.lstsq. No explicit missing data handling.
  • xyzpy.signal.xr_polyfit(obj, dim, ix=None, deg=0.5, poly='hermite') applies np.polynomial.polynomial.polyfit using xr.apply_ufunc along dim with the help of numba. Also supports other types of polynomial (legendre, chebyshev, ...). Missing values are masked out 1D wise.
  • geocat.comp.ndpolyfit(x: Iterable, y: Iterable, deg: int, axis: int = 0, **kwargs) -> (xr.DataArray, da.Array) reorders the array to apply np.polyfit along dim, returns the full outputs (residuals, rank, etc) as DataArray attributes. Missing values are masked out in bulk if possible, 1D-wise otherwise.

There does not seem to be matching polyval implementations for any of those nor support for indexing along a time dimension with a non-standard calendar.

@maboualidev
Copy link

maboualidev commented Dec 14, 2019

Hi @huard Thanks for the reply.

Regarding:

There does not seem to be matching polyval implementations for any of those nor support for indexing along a time dimension with a non-standard calendar.

There is a pull request on GeoCAT-comp for ndpolyval. I think polyval and polyfit go hand-in-hand. If we have ndpolyfit there must be a also a ndpolyval.

Regarding:

I see you're storing the residuals in the DataArray attributes. From my perspective, it would be useful to have those directly as DataArrays. Thoughts ?

I see the point and agree with you. I think it is a good idea to be as similar to NumPy.polyfit as possible; even for the style of the output. I will see it through to have that changed in GeoCAT.

attn: @clyne and @khallock

@huard
Copy link
Contributor

huard commented Dec 16, 2019

@maboualidev Is your objective to integrate the GeoCat implementation into xarray or keep it standalone ?

On my end, I'll submit a PR to add support for non-standard calendars to xarray.core.missing.get_clean_interp, which you'd then be able to use to get x values from coordinates.

@maboualidev
Copy link

@maboualidev Is your objective to integrate the GeoCat implementation into xarray or keep it standalone ?

GeoCAT is the python version of NCL and we are a team at NCAR working on it. I know that the team decision is to make use of Xarray within GeoCAT as much as possible, though.

@clyne
Copy link

clyne commented Dec 16, 2019 via email

@huard
Copy link
Contributor

huard commented Dec 16, 2019

@clyne Let me rephrase my question: how do you feel about xarray providing a polyfit/polyval implementation essentially duplicating GeoCat's implementation ?

@clyne
Copy link

clyne commented Dec 17, 2019

GeoCAT is licensed under Apache 2.0. So if someone wants to incorporate it into Xarray they are welcome to it :-)

@aulemahal
Copy link
Contributor

I pushed a new PR trying to implement polyfit in xarray, #3733. It is still work in progress, but I would like the opinion on those who participated in this thread.

Considering all options discussed in the thread, I chose an implementation that seemed to give the best performance and generality (skipping NaN values), but it duplicates a lot of code from numpy.polyfit.

Main question:

  • Should xarray's implementation really replicate the behaviour of numpy's?

A lot of extra code could be removed if we'd say we only want to compute and return the residuals and the coefficients. All the other variables are a few lines of code away for the user that really wants them, and they don't need the power of xarray and dask anyway.

I'm guessing @huard @dcherian @rabernat and @shoyer might have comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants