Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ffill, bfill with dask when limit is specified #9771

Merged

Conversation

josephnowak
Copy link
Contributor

@josephnowak josephnowak commented Nov 12, 2024

This improvement comes from the discussion on this PR #9712

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@josephnowak josephnowak changed the title Feature/reduce the number of tasks on the push method when the limit parameter is used Reduce the number of tasks on the push method when the limit parameter is used Nov 12, 2024
@josephnowak
Copy link
Contributor Author

The tests that are failing look unrelated to the change of this PR 🤔

@josephnowak
Copy link
Contributor Author

josephnowak commented Nov 12, 2024

@dcherian do you think that is a good idea to add a keepdims parameter to the first and last function implemented on the duck_array_ops.py file? if I add that I could use them directly on the cumreduction with blelloch method, if not I can just create a dummy wrapper like the one that I did for the push method to receive a type parameter that is not use

@dcherian
Copy link
Contributor

yes that sounds good to me.

@dcherian dcherian changed the title Reduce the number of tasks on the push method when the limit parameter is used Optimize ffill, bfill with dask when limit is specified Nov 12, 2024
@dcherian
Copy link
Contributor

Great! Can you add a note to whats-new please? This is worth advertising.

@dcherian dcherian added the plan to merge Final call for comments label Nov 12, 2024
@josephnowak
Copy link
Contributor Author

josephnowak commented Nov 13, 2024

Hi @dcherian

I think we could add this special case to try to improve the performance when the limit is equal or smaller than the chunksize on the axis (only one chunk would overlap). Is there any kind of benchmark that I can use to see if there is any real benefit from using map overlap over cumsum when the n is small?

    if n is not None and n <= array.chunksize[axis]:
        return array.map_overlap(_push, depth={axis: (n, 0)}, n=n, axis=axis)

@dcherian
Copy link
Contributor

dcherian commented Nov 14, 2024

I don't have one at hand. @phofl might have suggestions.

Note that chunks can be non-uniform, so you'll have to check every element of that array.chunks[axis] tuple.

@phofl
Copy link
Contributor

phofl commented Nov 14, 2024

Yes!

The highlevel_api function in https://github.com/coiled/benchmarks/blob/main/tests/geospatial/workloads/climatology.py

@josephnowak
Copy link
Contributor Author

thanks, I will try to run those benchmarks before modifying this PR, but from my understanding it should generate fewer tasks using the map overlap on a small N.

@dcherian
Copy link
Contributor

Yes I believe that's why @phofl proposed that optimization. We just need to make sure to trigger it in the right circumstances. Please add a test for non-uniform chunksizes where map_overlap would give you the wrong answer

@josephnowak
Copy link
Contributor Author

josephnowak commented Nov 14, 2024

After understanding more about the code that @phofl did, I think that the implementation works in the general case, the main issue with it is that it can load the whole array in memory when the N is equal to the size of the array (based on my understanding of how the map overlaps works) when I saw the PR I thought it was the same implementation than before, but what he did was to define the overlap length based on the limit and the previous algorithm (the one before the implementation with the cumreduction) applied a map_blocks using the push function first and later a map overlap but with a single element of overlapping instead of N which generated the issue when one or more chunks was completely empty. In summary, I think that the best is to use his algorithm when N is smaller than the chunk size of the axis, so we can avoid loading more than 2 chunks at the same time.

@dcherian
Copy link
Contributor

doesn't (n <= array.chunks[axis]).all() handle that case. It wouldn't trigger if n is greater than the chunksize, so you'll never load the whole thing in to memory at once?

In any case, how do you feel about merging this and optimizing later? I believe this is already a massive improvement.

@josephnowak
Copy link
Contributor Author

Yes, that condition would prevent loading all the chunks, I just wanted to mention that using that algorithm in the general case would be problematic due that it would load all the chunks at once but it would give the expected result at the end.

I'm okay with merging it without the special case optimization and adding that in a further PR, but I'm also okay with adding that special case on this PR, I let you the final decision, I already have it implemented on my local.

@dcherian
Copy link
Contributor

Ah then why don't you push what you have.

@phofl
Copy link
Contributor

phofl commented Nov 14, 2024

doesn't (n <= array.chunks[axis]).all() handle that case. It wouldn't trigger if n is greater than the chunksize, so you'll never load the whole thing in to memory at once?

Yes this is what I wanted to do as well (I got side-tracked over the last few days unfortunately). That is memory efficient and helps in some cases

@josephnowak
Copy link
Contributor Author

I didn't push it because the PR had a plan to merge the label, so I didn't want to modify the PR without your approval.

And sorry phofl for not getting the idea of your original algorithm the first time that I saw your PR, the good part is that now the algorithm behaves better in all the scenarios.

I'm not sure is why the tests are not running, should the conflicts be resolved first to run the tests?

@dcherian dcherian merged commit 4c8b03b into pydata:main Nov 14, 2024
28 of 29 checks passed
@dcherian
Copy link
Contributor

Nice work @josephnowak and @phofl

dcherian added a commit to dcherian/xarray that referenced this pull request Nov 16, 2024
* main:
  Add download stats badges (pydata#9786)
  Fix open_mfdataset for list of fsspec files (pydata#9785)
  add 'User-Agent'-header to pooch.retrieve (pydata#9782)
  Optimize `ffill`, `bfill` with dask when `limit` is specified (pydata#9771)
dcherian added a commit that referenced this pull request Nov 19, 2024
* main: (24 commits)
  Bump minimum versions (#9796)
  Namespace-aware `xarray.ufuncs` (#9776)
  Add prettier and pygrep hooks to pre-commit hooks (#9644)
  `rolling.construct`: Add `sliding_window_kwargs` to pipe arguments down to `sliding_window_view` (#9720)
  Bump codecov/codecov-action from 4.6.0 to 5.0.2 in the actions group (#9793)
  Buffer types (#9787)
  Add download stats badges (#9786)
  Fix open_mfdataset for list of fsspec files (#9785)
  add 'User-Agent'-header to pooch.retrieve (#9782)
  Optimize `ffill`, `bfill` with dask when `limit` is specified (#9771)
  fix cf decoding of grid_mapping (#9765)
  Allow wrapping `np.ndarray` subclasses (#9760)
  Optimize polyfit (#9766)
  Use `map_overlap` for rolling reductions with Dask (#9770)
  fix html repr indexes section (#9768)
  Bump pypa/gh-action-pypi-publish from 1.11.0 to 1.12.2 in the actions group (#9763)
  unpin array-api-strict, as issues are resolved upstream (#9762)
  rewrite the `min_deps_check` script (#9754)
  CI runs ruff instead of pep8speaks (#9759)
  Specify copyright holders in main license file (#9756)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Nov 19, 2024
* main:
  Bump minimum versions (pydata#9796)
  Namespace-aware `xarray.ufuncs` (pydata#9776)
  Add prettier and pygrep hooks to pre-commit hooks (pydata#9644)
  `rolling.construct`: Add `sliding_window_kwargs` to pipe arguments down to `sliding_window_view` (pydata#9720)
  Bump codecov/codecov-action from 4.6.0 to 5.0.2 in the actions group (pydata#9793)
  Buffer types (pydata#9787)
  Add download stats badges (pydata#9786)
  Fix open_mfdataset for list of fsspec files (pydata#9785)
  add 'User-Agent'-header to pooch.retrieve (pydata#9782)
  Optimize `ffill`, `bfill` with dask when `limit` is specified (pydata#9771)
  fix cf decoding of grid_mapping (pydata#9765)
  Allow wrapping `np.ndarray` subclasses (pydata#9760)
  Optimize polyfit (pydata#9766)
  Use `map_overlap` for rolling reductions with Dask (pydata#9770)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments topic-dask
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants