Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: multi-dimensional bins #28

Open
aaronspring opened this issue Mar 11, 2021 · 12 comments
Open

feature: multi-dimensional bins #28

aaronspring opened this issue Mar 11, 2021 · 12 comments

Comments

@aaronspring
Copy link
Contributor

currently, xhistogram only allows bins to be one-dimensional.

however, when the bin edges vary in time (seasonality) or space (locations of the globe) xhistogram cannot be used with multi-dim bins. there is a hard-coded requirement for bins elements to be 1-d.
One of such multi-dim bin applications is the ranked probability score rps we use in xskillscore.rps, where we want to know how many forecasts fell into which bins. Bins are often defined as terciles of the forecast distribution and the bins for these terciles (forecast_with_lon_lat_time_dims.quantile(q=[.33,.66],dim='time')) depend on lon and lat.

How we solved this in xskillscore.rps:
< gives us CDFs, and diff brings it back to histograms. maybe have to throw away the upper edge

Fc = (forecasts < forecasts_edges).mean('member').diff(bin_dim)
Oc = (observations < observations_edges).astype("int").diff(bin_dim)

https://github.com/xarray-contrib/xskillscore/blob/493f9afd7b5acefb4baa47bec6ad65fca19965bd/xskillscore/core/probabilistic.py#L680

I first implemented rps with xhistogram, then with the snippet above, yields same results.

However, I am not sure whether such multi-dimensional bins would be an interesting addition to xhistogram or are out-of-scope.

@dcherian
Copy link

I think this dask pR is related: dask/dask#7346

@aaronspring
Copy link
Contributor Author

after quickly skimming over np.histogram2d and this issue, I think that I am asking here for a different thing: I what that bins can be multi-dimensional, whereas in np.histogram bins is a 1d-array or int.

@dougiesquire
Copy link
Contributor

I think this is a really nice contribution that would be appreciated by many users, though I'm not sure whether it's out of scope for xhistogram (thoughts @rabernat?).

One simple option could be to allow bins to be an xarray object, in which case we use something like the approach you give above to compute the histogram?

@TomNicholas
Copy link
Member

TomNicholas commented May 27, 2021

@aaronspring I'm trying to make sure I understand what you're proposing here - is this an accurate summary of what you're suggesting?:

If I have data which includes a time dimension, and I currently count along other dimension(s), my result will still have a time dimension, but the resultant bin coordinates are currently only allowed to be one-dimensional. You are proposing to be able to pass a multidimensional bins array (for each variable potentially for an N-D histogram) which has bin edges that are a function of time. The result would have the same shape array for the bin_counts, but the bin coordinates on the output would vary along this time dimension (passed straight through from your input). You would have ended up with a histogram whose counts and bin edges changed over time - effectively a set of separate histograms calculated independently for each point in time.

If that's what you're suggesting then it actually sounds fairly doable - it's basically just allowing the bins arguments to be ND and then making sure they broadcast properly. You would also need an input check that your bins arrays don't vary along any of the dimensions you want to count over, because I think that would be nonsensical.

@aaronspring
Copy link
Contributor Author

@TomNicholas didnt see that linked PR. thanks for linking again here.

simply put: I want to use nd instead of 1d arrays as bins in xhistogram.

whats the kind of API/code example I was looking for: https://gist.github.com/aaronspring/251553f132202cc91aadde03f2a452f9 (dont focus on the results, just the dimensionality)

@dcherian
Copy link

Here's how to do it with flox: xarray-contrib/flox#203 [well hopefully I got it right ;)]

Rendered version

@TomNicholas
Copy link
Member

TomNicholas commented Jan 18, 2023 via email

@TomNicholas
Copy link
Member

What's the plan with flox integration into Xarray? Will it always be
optional? Will it become part of main?

Apparently flox will stay optional, so if we moved this functionality into xarray it would rely on an optional import, but that's okay.

@TomNicholas
Copy link
Member

@dcherian how might this work for N-dimensional histograms? I.e. placing N variables into N sets of bins. That's obviously one of the main features xhistogram provides. I notice your notebook says

The core factorize_ function (which wraps pd.cut) only handles 1D bins, so we use xr.apply_ufunc to vectorize it for us.

Does that mean we would still have to do a reshape of some kind?

@dcherian
Copy link

dcherian commented Feb 3, 2023

https://flox.readthedocs.io/en/latest/intro.html#histogramming-binning-by-multiple-variables

Does this make it clear

@TomNicholas
Copy link
Member

Ooh!

Checking I've understood this correctly:

xarray_reduce(
    da,                  # weights, here just ones
    "labels1",           # name of 1st variable we want to bin
    "labels2",           # name of 2nd variable we want to bin
    func="count",        # count occurrences falling in bins
    expected_groups=(
        pd.IntervalIndex.from_breaks([-0.5, 4.5, 6.5, 8.9]),  # bins for 1st variable
        pd.IntervalIndex.from_breaks([0.5, 1.5, 1.9]),        # bins for 2nd variable
    ),
)

@dcherian
Copy link

dcherian commented Feb 3, 2023

Yes. PR to improve that page is very welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants