Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep attrs & Add a 'keep_coords' argument to Dataset.apply #688

Closed
max-sixty opened this issue Dec 29, 2015 · 14 comments
Closed

Keep attrs & Add a 'keep_coords' argument to Dataset.apply #688

max-sixty opened this issue Dec 29, 2015 · 14 comments

Comments

@max-sixty
Copy link
Collaborator

Generally this isn't a problem, since the coords are carried over by the resulting DataArrays:

In [11]:

ds = xray.Dataset({
        'a':pd.DataFrame(pd.np.random.rand(10,3)),
        'b':pd.Series(pd.np.random.rand(10))
    })
ds.coords['c'] = pd.Series(pd.np.random.rand(10))
ds
Out[11]:
<xray.Dataset>
Dimensions:  (dim_0: 10, dim_1: 3)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3 4 5 6 7 8 9
  * dim_1    (dim_1) int64 0 1 2
    c        (dim_0) float64 0.9318 0.2899 0.3853 0.6235 0.9436 0.7928 ...
Data variables:
    a        (dim_0, dim_1) float64 0.5707 0.9485 0.3541 0.5987 0.406 0.7992 ...
    b        (dim_0) float64 0.4106 0.2316 0.5804 0.6393 0.5715 0.6463 ...
In [12]:

ds.apply(lambda x: x*2)
Out[12]:
<xray.Dataset>
Dimensions:  (dim_0: 10, dim_1: 3)
Coordinates:
    c        (dim_0) float64 0.9318 0.2899 0.3853 0.6235 0.9436 0.7928 ...
  * dim_0    (dim_0) int64 0 1 2 3 4 5 6 7 8 9
  * dim_1    (dim_1) int64 0 1 2
Data variables:
    a        (dim_0, dim_1) float64 1.141 1.897 0.7081 1.197 0.812 1.598 ...
    b        (dim_0) float64 0.8212 0.4631 1.161 1.279 1.143 1.293 0.3507 ...

But if there's an operation that removes the coords from the DataArrays, the coords are not there on the result (notice c below).
Should the Dataset retain them? Either always or with a keep_coords argument, similar to keep_attrs.

In [13]:

ds = xray.Dataset({
        'a':pd.DataFrame(pd.np.random.rand(10,3)),
        'b':pd.Series(pd.np.random.rand(10))
    })
ds.coords['c'] = pd.Series(pd.np.random.rand(10))
ds
Out[13]:
<xray.Dataset>
Dimensions:  (dim_0: 10, dim_1: 3)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3 4 5 6 7 8 9
  * dim_1    (dim_1) int64 0 1 2
    c        (dim_0) float64 0.4121 0.2507 0.6326 0.4031 0.6169 0.441 0.1146 ...
Data variables:
    a        (dim_0, dim_1) float64 0.4813 0.2479 0.5158 0.2787 0.06672 ...
    b        (dim_0) float64 0.2638 0.5788 0.6591 0.7174 0.3645 0.5655 ...
In [14]:

ds.apply(lambda x: x.to_pandas()*2)
Out[14]:
<xray.Dataset>
Dimensions:  (dim_0: 10, dim_1: 3)
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3 4 5 6 7 8 9
  * dim_1    (dim_1) int64 0 1 2
Data variables:
    a        (dim_0, dim_1) float64 0.9627 0.4957 1.032 0.5574 0.1334 0.8289 ...
    b        (dim_0) float64 0.5275 1.158 1.318 1.435 0.7291 1.131 0.1903 ...
@shoyer
Copy link
Member

shoyer commented Dec 29, 2015

I would be fine with a keep_coords argument.

I'm wary of always keeping coordinates, because some applied operations could make existing coordinates no longer valid. For example, suppose you want to use pandas's faster time-resampling, i.e., ds.apply(lambda x: x.to_pandas().resample('24H')). Any coordinates along the time would no longer be valid. We could automatically align the coordinates, but that starts to get increasingly magical...

@max-sixty
Copy link
Collaborator Author

Great @shoyer, agreed

@max-sixty max-sixty changed the title Should Dataset.apply retain coords? Add a 'keep_coords' argument to Dataset.apply Dec 29, 2015
@max-sixty
Copy link
Collaborator Author

Also attrs get cleared, which I think should be retained by default?

@max-sixty max-sixty changed the title Add a 'keep_coords' argument to Dataset.apply Keep attrs & Add a 'keep_coords' argument to Dataset.apply Feb 11, 2016
@snowman2
Copy link
Contributor

Is there plans for a 'keep_coords' for Dataset.resample as well?

@shoyer
Copy link
Member

shoyer commented Apr 26, 2017

@snowman2 Possibly yes, though we would want to think through the use-cases for this first. Arguably, you should explicitly preserve coordinates in your custom callable instead.

@snowman2
Copy link
Contributor

You could do it in the custom callable, but it requires less expertise and fewer lines of code to add that as an option. The use case I have is land surface model output with x,y coordinates that I would like to preserve.

@shoyer
Copy link
Member

shoyer commented Apr 30, 2017

@snowman2 Can you give a concrete example of the sort of function you would want to apply?

@snowman2
Copy link
Contributor

snowman2 commented May 1, 2017

I need input data for a hydrology model in an hourly timestep. So, I use the Dataset.resample method on data from land surface models to achieve that. Then, I use a custom linear interpolation to fill in the nan's. I then write out the data to a file. It is easier to write the resampled dataset to the file with the necessary information if the x,y coordinates are not removed in the Dataset.resample method.

@shoyer
Copy link
Member

shoyer commented May 25, 2017

@snowman2 I tried to reproduce your issue, but I couldn't make resample drop coordinates:

In [21]: ds = xarray.tutorial.load_dataset('rasm')

In [22]: ds.resample('AS', 'time', how=np.sum)
Out[22]:
<xarray.Dataset>
Dimensions:  (time: 4, x: 275, y: 205)
Coordinates:
    yc       (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
    xc       (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
  * time     (time) datetime64[ns] 1980-01-01 1981-01-01 1982-01-01 1983-01-01
Dimensions without coordinates: x, y
Data variables:
    Tair     (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...

@snowman2
Copy link
Contributor

@shoyer, thanks for looking into it. I am resampling from 3hr data to 1hr data.

resampled_ds = ds.resample('1H', dim='time', keep_attrs=True)

I am using it here:
https://github.com/CI-WATER/gsshapy/blob/f4e5cb13c1d528021e1953859b712553a4162311/gsshapy/grid/grid_to_gssha.py#L789-L844

I ran into the issue there and had to add code to make sure the coordinates were copied.

Thanks!

@shoyer
Copy link
Member

shoyer commented May 25, 2017

@snowman2 can you print an example of what self.data looks like? And desired vs. actual output if you remove those lines to add in the coordinates manually?

@snowman2
Copy link
Contributor

Strange. But I can't seem to re-produce the issue. Maybe it was on a Windows machine or maybe it is fixed now.

@stale
Copy link

stale bot commented Apr 25, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@max-sixty
Copy link
Collaborator Author

Closing as stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants