Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return a 3D object alongside 1D object in apply_ufunc #8695

Closed
ahuang11 opened this issue Feb 2, 2024 · 7 comments
Closed

Return a 3D object alongside 1D object in apply_ufunc #8695

ahuang11 opened this issue Feb 2, 2024 · 7 comments
Labels
enhancement plan to close May be closeable, needs more eyeballs

Comments

@ahuang11
Copy link
Contributor

ahuang11 commented Feb 2, 2024

Is your feature request related to a problem?

Currently, I have something similar to this, where the input_lat is transformed to new_lat (here, +0.25, but in real use case, it's indeterministic).

Since xarray_ufunc doesn't return a dataset with actual coordinates values, I had to return a second output to retain new_lat to properly update the coordinate values, but this second output is shaped time, lat, lon so I have to ds["lat"] = new_lat.isel(lon=0, time=0).values, which I think is inefficient; I simply need it to be shaped lat.

Any ideas on how I can modify this to make it more efficient?

import xarray as xr
import numpy as np

air = xr.tutorial.open_dataset("air_temperature")["air"]
input_lat = np.arange(20, 45)

def interp1d_np(data, base_lat, input_lat):
    new_lat = input_lat + 0.25
    return np.interp(new_lat, base_lat, data), new_lat

ds, new_lat = xr.apply_ufunc(
    interp1d_np,  # first the function
    air,
    air.lat,  # as above
    input_lat,  # as above
    input_core_dims=[["lat"], ["lat"], ["lat"]],  # list with one entry per arg
    output_core_dims=[["lat"], ["lat"]],  # returned data has one dimension
    exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be a set!
    vectorize=True,  # loop over non-core dims
)
new_lat = new_lat.isel(lon=0, time=0).values
ds["lat"] = new_lat

Describe the solution you'd like

Either be able to automatically assign the new_lat to the returned xarray object, or allow a 1D dataset to be returned

Describe alternatives you've considered

No response

Additional context

No response

@slevang
Copy link
Contributor

slevang commented Feb 6, 2024

If new_lat is really 1-dimensional, meaning it doesn't vary by lat/lon point, why can't it be calculated or defined prior to running anything through apply_ufunc, and then passed through as an input to your interp function?

@ahuang11
Copy link
Contributor Author

ahuang11 commented Feb 6, 2024

It's indeterministic, or it's decided inside apply_ufunc

def _help_downsample(data, time, n_out):
    indices = MinMaxLTTBDownsampler().downsample(time, data, n_out=n_out)
    return data[indices], indices


def apply_downsample(ts_ds, n_out):
    ts_ds_downsampled, indices = xr.apply_ufunc(
        _help_downsample,
        ts_ds["data"],
        ts_ds["time"],
        kwargs=dict(n_out=n_out),
        input_core_dims=[["time"], ["time"]],
        output_core_dims=[["time"], ["indices"]],
        exclude_dims=set(("time",)),
        vectorize=True,
        dask="parallelized",
        dask_gufunc_kwargs=dict(output_sizes={"time": n_out, "indices": n_out}),
    )
    ts_ds_downsampled["time"] = ts_ds["time"].isel(time=indices.values[0])
    return ts_ds_downsampled.rename("data")

@slevang
Copy link
Contributor

slevang commented Feb 6, 2024

If indices is non-deterministic for every time _help_downsample is called for new data, doesn't that mean you want a unique vector of indices at all non-core input dims? You're calling apply_ufunc(..., vectorize=True) so the helper function is being called once per 1D timeseries, and selecting indices[0] would be throwing away unique index vectors in that case.

@ahuang11
Copy link
Contributor Author

ahuang11 commented Feb 6, 2024

You're right. I suppose I misunderstood; I thought it was stacking all the channels to calculate the indices.

Perhaps what I really want to do is just the following without apply_ufunc

new_da = da.stack({"temp": ["time", "channel"]})
MinMaxLTTBDownsampler().downsample(new_da["temp"], new_da["data"], n_out=n_out)

@dcherian
Copy link
Contributor

dcherian commented Feb 6, 2024

I've run in to this before. There's no way to limit the broadcasting behaviour, so if there are broadcast dimensions on the input variables, those are assumed to be the broadcast dimensions on all output variables.

I believe the best solution is #1618 . In the mean time if you don't need vectorize you can specify all input and output dims as core dimensions and then there are no broadcast dimensions at all.

@dcherian dcherian added the plan to close May be closeable, needs more eyeballs label Feb 6, 2024
@ahuang11
Copy link
Contributor Author

Is there a way to do this without a loop?

import numpy as np
t0 = [0, 1, 2]
t1 = [3, 4, 5]

ts = [t0, t1]
z = np.array([[6, 7, 8], [9, 10, 11]])

ds = xr.DataArray(z, dims=["channel", "time"]).to_dataset("channel")
for i, channel in enumerate(ds.data_vars):
    dim = f"time_{i}"
    ds[channel] = (dim, ds[channel].values)
    ds[channel] = ds[channel].assign_coords({dim: ts[i]})
ds

@slevang
Copy link
Contributor

slevang commented Feb 15, 2024

If the time vectors are all the same length, you can just store as a multi-dimensional coordinate:

ds = xr.DataArray(
    z, 
    dims=["channel", "t_idx"], 
    coords={"time": (("channel", "t_idx"), ts)}
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement plan to close May be closeable, needs more eyeballs
Projects
None yet
Development

No branches or pull requests

4 participants