Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial rechunks within P2P #8330

Merged
merged 40 commits into from
Dec 20, 2023
Merged

Conversation

hendrikmakait
Copy link
Member

@hendrikmakait hendrikmakait commented Nov 7, 2023

Closes #8326
Blocked by #8207

This PR currently splits on entirely independent subset of the rechunking. A more intricate (yet better) implementation would create independent shuffles even for cases where an input would go into more than one shuffle.

This PR splits any outputs that allows us to cull more inputs, i.e., two outputs along an axis are separated if they do not end in the same input chunk.

  • Tests added / passed
  • Passes pre-commit run --all-files

Copy link
Contributor

github-actions bot commented Nov 7, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       27 files  ±  0         27 suites  ±0   11h 57m 44s ⏱️ + 21m 4s
  3 950 tests +  2    3 838 ✔️ +  4     109 💤  -   1  3  - 1 
49 683 runs  +42  47 392 ✔️ +64  2 288 💤  - 19  3  - 3 

For more details on these failures, see this check.

Results for commit 1f554ef. ± Comparison against base commit 9273186.

This pull request removes 5 and adds 7 tests. Note that renamed tests count towards both.
distributed.shuffle.tests.test_rechunk ‑ test_rechunk_with_single_output_chunk_raises
distributed.shuffle.tests.test_rechunk ‑ test_worker_for_homogeneous_distribution[1]
distributed.shuffle.tests.test_rechunk ‑ test_worker_for_homogeneous_distribution[2]
distributed.shuffle.tests.test_rechunk ‑ test_worker_for_homogeneous_distribution[41]
distributed.shuffle.tests.test_rechunk ‑ test_worker_for_homogeneous_distribution[50]
distributed.shuffle.tests.test_rechunk ‑ test_cull_p2p_rechunk_independent_partitions
distributed.shuffle.tests.test_rechunk ‑ test_cull_p2p_rechunk_overlapping_partitions
distributed.shuffle.tests.test_rechunk ‑ test_partial_rechunk_homogeneous_distribution
distributed.shuffle.tests.test_rechunk ‑ test_pick_worker_homogeneous_distribution[1]
distributed.shuffle.tests.test_rechunk ‑ test_pick_worker_homogeneous_distribution[2]
distributed.shuffle.tests.test_rechunk ‑ test_pick_worker_homogeneous_distribution[41]
distributed.shuffle.tests.test_rechunk ‑ test_pick_worker_homogeneous_distribution[50]

♻️ This comment has been updated with latest results.

@hendrikmakait hendrikmakait marked this pull request as ready for review November 24, 2023 12:21
@hendrikmakait hendrikmakait marked this pull request as draft November 24, 2023 12:50
@hendrikmakait hendrikmakait marked this pull request as ready for review November 28, 2023 09:55
@hendrikmakait hendrikmakait changed the title [WIP] Partial rechunks within P2P Partial rechunks within P2P Nov 28, 2023
@hendrikmakait
Copy link
Member Author

There's some stuff we still calculate for the individual n-dimensional partials (scaling as a product) but that could be computed for the partials per axis (scaling as a sum). I'll take a another look at some quick refactoring here but this might very well be premature optimization.

crusaderky added a commit to crusaderky/distributed that referenced this pull request Dec 18, 2023
@hendrikmakait
Copy link
Member Author

@crusaderky: Looking at my old benchmark results, you are in fact reproducing them. However, test_tiles_to_rows has significantly improved in performance since then:

Screenshot 2023-12-18 at 15 28 14 Screenshot 2023-12-18 at 15 29 14

@hendrikmakait
Copy link
Member Author

Searching through the commit history, this is likely because #8207 has been merged, which this branch also included at the time.

@crusaderky
Copy link
Collaborator

crusaderky commented Dec 18, 2023

The side-by-side dashboard makes it pretty obvious why there's such a dramatic slowdown.

test_adjacent_groups[0.3-128 MiB-p2p-disk], 180 input/output chunks.
This PR picks even workers only for the output chunks! (worker 0 is always at the bottom)

peek.mp4

@hendrikmakait
Copy link
Member Author

Quick summary of an offline discussion I had with @crusaderky: In the example above, each partial rechunk has five output tasks (180 inputs / 36 barriers). As a result, the deterministic worker assignment logic will always pick the same five workers to store those outputs which underutilizes the cluster. We should be able to assess the impact of better worker assignment by randomizing it.

@crusaderky
Copy link
Collaborator

Adding a naive randomization of the workers (d9116fb) fixes most problems:

image
image

@hendrikmakait
Copy link
Member Author

Regarding the regression of test_swap_axes, I've seen that happen before when running benchmarks for this PR. In the past, they disappeared if I test that one in isolation. I'm not sure what's going on exactly.

@crusaderky
Copy link
Collaborator

Regarding the regression of test_swap_axes, I've seen that happen before when running benchmarks for this PR. In the past, they disappeared if I test that one in isolation. I'm not sure what's going on exactly.

Reran it in isolation. You're right, these results don't make sense for me

image

@crusaderky
Copy link
Collaborator

Waiting for an alternative suggestion to the worker randomization from @fjetter, as discussed offline.
This PR is otherwise green to go for me.

@hendrikmakait hendrikmakait marked this pull request as ready for review December 19, 2023 16:05
@hendrikmakait
Copy link
Member Author

I think what @fjetter and I both have in mind would look like this: hendrikmakait#14

FWIW, I'd not block on that, a follow-up with another benchmark run is trivial.

@crusaderky
Copy link
Collaborator

Running a final A/B test. Results out tomorrow morning.

@crusaderky
Copy link
Collaborator

Looks good. Comment is the same as in the previous posts.

image
image

@fjetter
Copy link
Member

fjetter commented Dec 20, 2023

I think what @fjetter and I both have in mind would look like this: hendrikmakait#14

essentially, yes. I would've not shifted by one but by len(workers) but the gist is the same. I think shifting by len(workers) generates more homogeneous distributions for small numbers of shuffles

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Dec 20, 2023

essentially, yes. I would've not shifted by one but by len(workers) but the gist is the same. I think shifting by len(workers) generates more homogeneous distributions for small numbers of shuffles

I'm confused, wouldn't shifting by len(workers) have no effect whatsoever in a cluster without fluctuating workers?

The problem the shift solves is that static range partitioning if len(partitions) < len(workers) skips some workers (e.g., only picks every other worker if len(partitions) / len(workers) = 0.5). By shifting by one, this should be as homogeneous as possible for any range-partitioned pattern as long as len(partitions) remains constant. For non-constant len(partitions), I can't come up with anything else that's better unless we start counting assignments per worker.

@hendrikmakait
Copy link
Member Author

hendrikmakait commented Dec 20, 2023

Note that we could potentially investigate switching assignment to round-robin and then shift by the len(partitions) % len(workers). IIRC, the lack of buffering in our disk buffers currently prohibits that though.

@hendrikmakait hendrikmakait merged commit e6c7d66 into dask:main Dec 20, 2023
28 of 34 checks passed
@hendrikmakait hendrikmakait deleted the local-rechunks branch December 20, 2023 14:38
@crusaderky
Copy link
Collaborator

🥳

@hendrikmakait hendrikmakait mentioned this pull request Jan 12, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

P2P Rechunking graph prohibits culling and partial release
3 participants