Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed .isel for DatasetRolling.construct consistent rolling behavior #7578

Merged
merged 11 commits into from
Sep 20, 2023

Conversation

p4perf4ce
Copy link
Contributor

@p4perf4ce p4perf4ce commented Mar 2, 2023

Dataset(...).isel(...) at the return caused DatasetRolling.construct behavior to be inconsistent with DataArrayRolling.construct when stride > 1 without any benefits.

The bug was reported in #7021

`.isel` causes `DatasetRolling.construct` to behavior to be inconsistent with `DataArrayRolling.construct` when `stride` > 1.
…inconsistent-behavior

Removed `.isel` for `DatasetRolling.construct` consistent rolling behavior.
@p4perf4ce p4perf4ce changed the title Removed .isel for DatasetRolling.construct consistent rolling behavior Removed .isel for DatasetRolling.construct consistent rolling behavior Mar 2, 2023
@headtr1ck
Copy link
Collaborator

Good spot. The fix seems good to me.
If you fix the tests and add a whats-new this is good to go

@headtr1ck
Copy link
Collaborator

headtr1ck commented Sep 15, 2023

Sorry for the late reply...
This PR is still not providing the correct solution.

For some reason, when you add a dimension-coordinate to the Dataset it behaves as expected (before) and now wrong (with this PR):

E.g. use these two test datasets:

ds1 = xr.Dataset({"x": ("t", range(20))})
ds2 = xr.Dataset({"x": ("t", range(20))}, {"t": range(20)})

print("Dataset rolling: ds1")
print(ds1.rolling(t=4).construct("w", stride=2).x.shape)  # wrong in main
print("DataArray rolling: ds1")
print(ds1.x.rolling(t=4).construct("w", stride=2).shape)
print("Dataset rolling: ds2")
print(ds2.rolling(t=4).construct("w", stride=2).x.shape)  # wrong in this PR
print("DataArray rolling: ds2")
print(ds2.x.rolling(t=4).construct("w", stride=2).shape)

The unit tests use datasets with an dimension-coordinate, therefore this error was never spotted.

@headtr1ck
Copy link
Collaborator

headtr1ck commented Sep 15, 2023

After digging a bit more, the problematic line is the return one:
Dataset(dataset, coords=self.obj.coords ...
Here, dataset is a dict of constructed DataArrays (they have the correct shape) which have the strided "t" index and self.obj.coords are the original coordinates which contain the unstrided "t" index.
Upon Dataset creation, these are getting aligned and NaNs are inserted into the constructed DataArrays.

The isel in the end is supposed to remove the inserted NaNs again.

So I think we have to find an intermediate solution and remove the isel + adopt what we pass to coords.

@headtr1ck headtr1ck added the plan to merge Final call for comments label Sep 15, 2023
@headtr1ck headtr1ck requested a review from mathause September 15, 2023 20:39
@headtr1ck headtr1ck removed the plan to merge Final call for comments label Sep 16, 2023
@headtr1ck
Copy link
Collaborator

Nvmd. I have added another test with more dimensions and 2D coordinates.
My approach does not work here :/

cannot reindex or align along dimension 't' because of conflicting dimension sizes: {10, 20} (note: an index is found along that dimension with size=10)

Anyone knows how to align here properly? Coords do not have an isel, otherwise one could simply apply the stride as well.

@headtr1ck
Copy link
Collaborator

For now the approach is to stride the original dataset and then extract the coords from there. Ofc, this strides the dataset variables which are then not used, so unnecessary computation. However this approach is already much faster and memory efficient than the previous approach.

@p4perf4ce
Copy link
Contributor Author

My apologies for very late reply. Got tons of backlog until seeing this popped up in my mailbox.

The isel in the end is supposed to remove the inserted NaNs again.

Thank you for digging this out. I was dumbfounded when looking at this particular line, haven't thought about NaNs case back then.

For now the approach is to stride the original dataset and then extract the coords from there. Ofc, this strides the dataset variables which are then not used, so unnecessary computation. However this approach is already much faster and memory efficient than the previous approach.

I've ended up with something similar but a little bit different in my own internal repository. I've found that it's a bit more efficient and more practical to just create a class of Virtual rolling coordinate then accessing the data by asking the virtual coordinate to provide me a coord at window i. So float64 dataset with 10++ dimensions (I'm dealing with biosignal datasets) won't explode my memory (only coords that grew), but it is a little bit too radical of a change for the current approach that both I proposed originally, and you have done so far.

But it would turn your question of how to not stride over coord again to just doing things only on coord and left the rest as is.

@headtr1ck
Copy link
Collaborator

Not sure I understand what you mean.

The current approach only temporarily strides the dataset including it's coords and then extracts those coords.
Since it does not align the result to the non-strided indexes anymore it should be much more memory efficient.

Unfortunately the Coordinates class does not support indexing, so we have to do it at the dataset level. I think it should not add too much overhead because it is index based lookup.

The main difference between this approach and what you did is that it supports coordinates that have different dimensions than the data variables (see the new test).

@headtr1ck
Copy link
Collaborator

Actually we could add a peakmem asv benchmark for this and see how much more.memory efficient it is.

@headtr1ck headtr1ck added the run-benchmark Run the ASV benchmark workflow label Sep 18, 2023
@headtr1ck
Copy link
Collaborator

Ok, it went from 141MB to 196MB...
I guess my assumption was wrong.

Does anyone have any idea why?

@p4perf4ce
Copy link
Contributor Author

p4perf4ce commented Sep 18, 2023

The main difference between this approach and what you did is that it supports coordinates that have different dimensions than the data variables (see the new test).

Sorry for the confusion, I meant that I ended up wrote something on-top of xarray while waiting for PR review since I was in need of a correct rolling with stride function. And used a different approach than I did here in this PR.

After recollecting myself what I did in March and what you've done to fix my PR. It seems that we both end up with a similar solution on this topic, excluding some minor caveats. I agree that your suggested change is already memory efficient (and still simple to understand the codebase).

Ok, it went from 141MB to 196MB...
I guess my assumption was wrong.
Does anyone have any idea why?

I think this is within the expectation? Because original behavior causes the result to fall short by a large margin (see my issue at #7021). Now that this PR fixing it, number of result windows should be larger (thus larger memory footprint) when running benchmark against the mainline branch.

Thank you for your helpful feedback!

@headtr1ck
Copy link
Collaborator

I think this is within the expectation? Because original behavior causes the result to fall short by a large margin (see my issue at #7021). Now that this PR fixing it, number of result windows should be larger (thus larger memory footprint) when running benchmark against the mainline branch.

Your behavior was for Datasets without an dimension coordinate (a coordinate that is called the same as the dimension), the benchmark uses a Dataset with (otherwise I cannot compare the results correctly).

This behavior was correct before, because what happened is that the created arrays were first strided, then extended to full again filling the missing data with NaNs and then strided again (this was the isel in the end).

So I expect the extending to full part to consume more memory, but seeing in the benchmark it does apparently not.

Anyway I think this PR is a good addition because it fixes a bug, which is far more important than performance.

Thank you for your helpful feedback!

You're welcome!

@headtr1ck headtr1ck added needs review plan to merge Final call for comments labels Sep 18, 2023
@headtr1ck
Copy link
Collaborator

I just increased the dimensions as well in CI and now we get:

| Change   | Before [b08a9d86]    | After [7359559b]    |   Ratio | Benchmark (Parameter)                                        |
+ grep 'Traceback \|failed\|PERFORMANCE DECREASED' benchmarks.log
|----------|----------------------|---------------------|---------|--------------------------------------------------------------|
| -        | 5.57G                | 204M                |    0.04 | rolling.DatasetRollingMemory.peakmem_1drolling_construct(5)  |
| -        | 5.59G                | 204M                |    0.04 | rolling.DatasetRollingMemory.peakmem_1drolling_construct(50) |

So it got much better? This seems strange...
But anyway, I think this is good to go.

@p4perf4ce
Copy link
Contributor Author

So it got much better? This seems strange...

Just to note. It still almost impossible to run construct with stride>1 on a large dataset even with the first commit in this PR (only fix my wanted behavior) because it ended up not creating a view but actually allocating memory (according to memray) for rolling windows, so I thought that this was an intended behavior (leading to my reply suggesting some virtual lookup things I used elsewhere). My assumption is something weird is going on when extending to full part since reduce doesn't suffer the same issue. I might help digging down on this later but doesn't seem to be an immediate issue.

Anyway, somehow it ends up fixing both behavior bug and performance. Thank you everyone.

@headtr1ck
Copy link
Collaborator

Thanks for starting this :) Your insight was helpful in figuring out what was going wrong.

Just a tip: Do future PRs on a branch in your forked repo and not the main branch. Since we do squash commits, your history will be divergent to xarrays main branch and you will have to force push. Doing this in a branch prevents this :)

Lets wait a day or two and then merge :)

Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@headtr1ck headtr1ck merged commit 2b784f2 into pydata:main Sep 20, 2023
24 checks passed
@welcome
Copy link

welcome bot commented Sep 20, 2023

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

@keewis keewis mentioned this pull request Sep 25, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow topic-performance topic-rolling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent behavior between DatasetRolling.construct and DataArrayRolling.construct with stride > 1.
4 participants