Removed `.isel` for `DatasetRolling.construct` consistent rolling behavior #7578

p4perf4ce · 2023-03-02T21:12:39Z

Dataset(...).isel(...) at the return caused DatasetRolling.construct behavior to be inconsistent with DataArrayRolling.construct when stride > 1 without any benefits.

The bug was reported in #7021

Closes Inconsistent behavior between DatasetRolling.construct and DataArrayRolling.construct with stride > 1. #7021
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

`.isel` causes `DatasetRolling.construct` to behavior to be inconsistent with `DataArrayRolling.construct` when `stride` > 1.

…inconsistent-behavior Removed `.isel` for `DatasetRolling.construct` consistent rolling behavior.

headtr1ck · 2023-03-09T20:36:31Z

Good spot. The fix seems good to me.
If you fix the tests and add a whats-new this is good to go

headtr1ck · 2023-09-15T19:26:15Z

Sorry for the late reply...
This PR is still not providing the correct solution.

For some reason, when you add a dimension-coordinate to the Dataset it behaves as expected (before) and now wrong (with this PR):

E.g. use these two test datasets:

ds1 = xr.Dataset({"x": ("t", range(20))})
ds2 = xr.Dataset({"x": ("t", range(20))}, {"t": range(20)})

print("Dataset rolling: ds1")
print(ds1.rolling(t=4).construct("w", stride=2).x.shape)  # wrong in main
print("DataArray rolling: ds1")
print(ds1.x.rolling(t=4).construct("w", stride=2).shape)
print("Dataset rolling: ds2")
print(ds2.rolling(t=4).construct("w", stride=2).x.shape)  # wrong in this PR
print("DataArray rolling: ds2")
print(ds2.x.rolling(t=4).construct("w", stride=2).shape)

The unit tests use datasets with an dimension-coordinate, therefore this error was never spotted.

headtr1ck · 2023-09-15T19:40:53Z

After digging a bit more, the problematic line is the return one:
Dataset(dataset, coords=self.obj.coords ...
Here, dataset is a dict of constructed DataArrays (they have the correct shape) which have the strided "t" index and self.obj.coords are the original coordinates which contain the unstrided "t" index.
Upon Dataset creation, these are getting aligned and NaNs are inserted into the constructed DataArrays.

The isel in the end is supposed to remove the inserted NaNs again.

So I think we have to find an intermediate solution and remove the isel + adopt what we pass to coords.

headtr1ck · 2023-09-16T19:06:34Z

Nvmd. I have added another test with more dimensions and 2D coordinates.
My approach does not work here :/

cannot reindex or align along dimension 't' because of conflicting dimension sizes: {10, 20} (note: an index is found along that dimension with size=10)

Anyone knows how to align here properly? Coords do not have an isel, otherwise one could simply apply the stride as well.

headtr1ck · 2023-09-17T08:51:02Z

For now the approach is to stride the original dataset and then extract the coords from there. Ofc, this strides the dataset variables which are then not used, so unnecessary computation. However this approach is already much faster and memory efficient than the previous approach.

p4perf4ce · 2023-09-17T20:52:47Z

My apologies for very late reply. Got tons of backlog until seeing this popped up in my mailbox.

The isel in the end is supposed to remove the inserted NaNs again.

Thank you for digging this out. I was dumbfounded when looking at this particular line, haven't thought about NaNs case back then.

For now the approach is to stride the original dataset and then extract the coords from there. Ofc, this strides the dataset variables which are then not used, so unnecessary computation. However this approach is already much faster and memory efficient than the previous approach.

I've ended up with something similar but a little bit different in my own internal repository. I've found that it's a bit more efficient and more practical to just create a class of Virtual rolling coordinate then accessing the data by asking the virtual coordinate to provide me a coord at window i. So float64 dataset with 10++ dimensions (I'm dealing with biosignal datasets) won't explode my memory (only coords that grew), but it is a little bit too radical of a change for the current approach that both I proposed originally, and you have done so far.

But it would turn your question of how to not stride over coord again to just doing things only on coord and left the rest as is.

headtr1ck · 2023-09-18T11:54:27Z

Not sure I understand what you mean.

The current approach only temporarily strides the dataset including it's coords and then extracts those coords.
Since it does not align the result to the non-strided indexes anymore it should be much more memory efficient.

Unfortunately the Coordinates class does not support indexing, so we have to do it at the dataset level. I think it should not add too much overhead because it is index based lookup.

The main difference between this approach and what you did is that it supports coordinates that have different dimensions than the data variables (see the new test).

headtr1ck · 2023-09-18T11:56:05Z

Actually we could add a peakmem asv benchmark for this and see how much more.memory efficient it is.

headtr1ck · 2023-09-18T15:08:19Z

Ok, it went from 141MB to 196MB...
I guess my assumption was wrong.

Does anyone have any idea why?

p4perf4ce · 2023-09-18T16:57:10Z

The main difference between this approach and what you did is that it supports coordinates that have different dimensions than the data variables (see the new test).

Sorry for the confusion, I meant that I ended up wrote something on-top of xarray while waiting for PR review since I was in need of a correct rolling with stride function. And used a different approach than I did here in this PR.

After recollecting myself what I did in March and what you've done to fix my PR. It seems that we both end up with a similar solution on this topic, excluding some minor caveats. I agree that your suggested change is already memory efficient (and still simple to understand the codebase).

Ok, it went from 141MB to 196MB...
I guess my assumption was wrong.
Does anyone have any idea why?

I think this is within the expectation? Because original behavior causes the result to fall short by a large margin (see my issue at #7021). Now that this PR fixing it, number of result windows should be larger (thus larger memory footprint) when running benchmark against the mainline branch.

Thank you for your helpful feedback!

headtr1ck · 2023-09-18T17:49:18Z

I think this is within the expectation? Because original behavior causes the result to fall short by a large margin (see my issue at #7021). Now that this PR fixing it, number of result windows should be larger (thus larger memory footprint) when running benchmark against the mainline branch.

Your behavior was for Datasets without an dimension coordinate (a coordinate that is called the same as the dimension), the benchmark uses a Dataset with (otherwise I cannot compare the results correctly).

This behavior was correct before, because what happened is that the created arrays were first strided, then extended to full again filling the missing data with NaNs and then strided again (this was the isel in the end).

So I expect the extending to full part to consume more memory, but seeing in the benchmark it does apparently not.

Anyway I think this PR is a good addition because it fixes a bug, which is far more important than performance.

Thank you for your helpful feedback!

You're welcome!

xarray/core/rolling.py

headtr1ck · 2023-09-19T18:15:35Z

I just increased the dimensions as well in CI and now we get:

| Change   | Before [b08a9d86]    | After [7359559b]    |   Ratio | Benchmark (Parameter)                                        |
+ grep 'Traceback \|failed\|PERFORMANCE DECREASED' benchmarks.log
|----------|----------------------|---------------------|---------|--------------------------------------------------------------|
| -        | 5.57G                | 204M                |    0.04 | rolling.DatasetRollingMemory.peakmem_1drolling_construct(5)  |
| -        | 5.59G                | 204M                |    0.04 | rolling.DatasetRollingMemory.peakmem_1drolling_construct(50) |

So it got much better? This seems strange...
But anyway, I think this is good to go.

p4perf4ce · 2023-09-19T19:25:38Z

So it got much better? This seems strange...

Just to note. It still almost impossible to run construct with stride>1 on a large dataset even with the first commit in this PR (only fix my wanted behavior) because it ended up not creating a view but actually allocating memory (according to memray) for rolling windows, so I thought that this was an intended behavior (leading to my reply suggesting some virtual lookup things I used elsewhere). My assumption is something weird is going on when extending to full part since reduce doesn't suffer the same issue. I might help digging down on this later but doesn't seem to be an immediate issue.

Anyway, somehow it ends up fixing both behavior bug and performance. Thank you everyone.

headtr1ck · 2023-09-19T19:31:57Z

Thanks for starting this :) Your insight was helpful in figuring out what was going wrong.

Just a tip: Do future PRs on a branch in your forked repo and not the main branch. Since we do squash commits, your history will be divergent to xarrays main branch and you will have to force push. Doing this in a branch prevents this :)

Lets wait a day or two and then merge :)

mathause

LGTM!

welcome · 2023-09-20T19:31:51Z

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again!

p4perf4ce added 2 commits March 3, 2023 04:03

Removed .isel for consistent rolling behavior.

5f297a7

`.isel` causes `DatasetRolling.construct` to behavior to be inconsistent with `DataArrayRolling.construct` when `stride` > 1.

Merge pull request #1 from p4perf4ce/Fixs-DatasetRolling.construct-…

fe85d80

…inconsistent-behavior Removed `.isel` for `DatasetRolling.construct` consistent rolling behavior.

github-actions bot added the topic-rolling label Mar 2, 2023

p4perf4ce changed the title ~~Removed .isel for DatasetRolling.construct consistent rolling behavior~~ Removed .isel for DatasetRolling.construct consistent rolling behavior Mar 2, 2023

headtr1ck added 3 commits September 15, 2023 22:26

Merge branch 'main' into papermain

e9da25c

new rolling construct strategy for coords

4030eff

add whats-new

be96f60

headtr1ck added the plan to merge Final call for comments label Sep 15, 2023

headtr1ck requested a review from mathause September 15, 2023 20:39

add new tests with different coords

f2d4d82

headtr1ck removed the plan to merge Final call for comments label Sep 16, 2023

next try on aligning strided coords

c2d7ca9

headtr1ck added 2 commits September 18, 2023 14:35

Merge branch 'main' into papermain

2120210

add peakmem test for rolling.construct

44a59ab

github-actions bot added the topic-performance label Sep 18, 2023

headtr1ck added the run-benchmark Run the ASV benchmark workflow label Sep 18, 2023

headtr1ck added needs review plan to merge Final call for comments labels Sep 18, 2023

dcherian reviewed Sep 19, 2023

View reviewed changes

xarray/core/rolling.py Show resolved Hide resolved

dcherian removed the needs review label Sep 19, 2023

increase asv benchmark rolling sizes

6744b9c

Merge branch 'main' into main

e0533a6

mathause approved these changes Sep 19, 2023

View reviewed changes

headtr1ck merged commit 2b784f2 into pydata:main Sep 20, 2023
24 checks passed

keewis mentioned this pull request Sep 25, 2023

Release 2023.09.0 #8229

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed `.isel` for `DatasetRolling.construct` consistent rolling behavior #7578

Removed `.isel` for `DatasetRolling.construct` consistent rolling behavior #7578

p4perf4ce commented Mar 2, 2023 •

edited by headtr1ck

Loading

headtr1ck commented Mar 9, 2023

headtr1ck commented Sep 15, 2023 •

edited

Loading

headtr1ck commented Sep 15, 2023 •

edited

Loading

headtr1ck commented Sep 16, 2023

headtr1ck commented Sep 17, 2023

p4perf4ce commented Sep 17, 2023

headtr1ck commented Sep 18, 2023

headtr1ck commented Sep 18, 2023

headtr1ck commented Sep 18, 2023

p4perf4ce commented Sep 18, 2023 •

edited

Loading

headtr1ck commented Sep 18, 2023

headtr1ck commented Sep 19, 2023

p4perf4ce commented Sep 19, 2023

headtr1ck commented Sep 19, 2023

mathause left a comment

welcome bot commented Sep 20, 2023

Removed .isel for DatasetRolling.construct consistent rolling behavior #7578

Removed .isel for DatasetRolling.construct consistent rolling behavior #7578

Conversation

p4perf4ce commented Mar 2, 2023 • edited by headtr1ck Loading

headtr1ck commented Mar 9, 2023

headtr1ck commented Sep 15, 2023 • edited Loading

headtr1ck commented Sep 15, 2023 • edited Loading

headtr1ck commented Sep 16, 2023

headtr1ck commented Sep 17, 2023

p4perf4ce commented Sep 17, 2023

headtr1ck commented Sep 18, 2023

headtr1ck commented Sep 18, 2023

headtr1ck commented Sep 18, 2023

p4perf4ce commented Sep 18, 2023 • edited Loading

headtr1ck commented Sep 18, 2023

headtr1ck commented Sep 19, 2023

p4perf4ce commented Sep 19, 2023

headtr1ck commented Sep 19, 2023

mathause left a comment

Choose a reason for hiding this comment

welcome bot commented Sep 20, 2023

Removed `.isel` for `DatasetRolling.construct` consistent rolling behavior #7578

Removed `.isel` for `DatasetRolling.construct` consistent rolling behavior #7578

p4perf4ce commented Mar 2, 2023 •

edited by headtr1ck

Loading

headtr1ck commented Sep 15, 2023 •

edited

Loading

headtr1ck commented Sep 15, 2023 •

edited

Loading

p4perf4ce commented Sep 18, 2023 •

edited

Loading