[DNM] P2P shuffle skeleton - scheduler plugin #5524

gjoseph92 · 2021-11-18T09:26:36Z

Alternative to #5520: a peer-to-peer shuffle skeleton, based on a scheduler plugin to handle much of the synchronization.

There are some hacky things, but generally I think this shows more promise than #5520. Things this can do that the other PR cannot:

Wait to initialize a shuffle until it actually starts
Support blockwise fusion of the output tasks
Detect when input task culling has happened and raise an immediate error. There's also a path forward from here for actually supporting this culling

This also doesn't match the design in #5435—in reality, a bit needed to change to work with the scheduler driving things—but overall, I like this design more than #5520. It also feels a little easier to build off of.

This needs unit tests, especially for the more exciting logic around concurrent shuffles, waiting for the scheduler, etc. But in the basic tests, it seems to shuffle correctly, including sequential and concurrent shuffles, and also passes test_shuffle.py.

Since this design differs a little from #5435, here's a rough diagram of the flow:

cc @fjetter

Tests added / passed
Passes pre-commit run --all-files

gjoseph92

Beyond whether or not we want to use the scheduler plugin, I like the code in this more than #5520. Overall the style is a little simpler, which is a nicer starting point to build other things off of. #5520 feels slightly over-engineered in comparison, with the ShuffleExtension containing Shuffles containing ShuffleMetadatas.

The problem with sibling shuffles noted at the end of test_graph.py also will probably require more complex task-traversing logic on the scheduler to pick the right worker per shuffle. That makes me think we'll have a scheduler plugin regardless, whether it offers an RPC or sets the restrictions itself.

gjoseph92 · 2021-11-18T21:43:35Z

distributed/shuffle/shuffle_worker.py

+        ), f"Removed {id}, which still has data for output partitions {list(data)}"
+
+    async def get_shuffle(self, id: ShuffleId) -> ShuffleState:
+        "Get the `ShuffleState`, blocking until it's been received from the scheduler."


@fjetter mentioned that this whole method and waiting_for_metadata may be unnecessary. Since messages from scheduler to workers remain ordered in BatchedSend (and TCP preserves ordering), we can probably count on the shuffle_init hitting the worker before the add_partition does, so long as we trust the transition logic of our plugin.

gjoseph92 · 2021-11-18T22:28:33Z

distributed/shuffle/shuffle_scheduler.py

+                    del self.output_keys[k]
+
+    def transition(self, key: str, start: str, finish: str, *args, **kwargs):
+        "Watch transitions for keys we care about"


Having this run for every key is not ideal. I've tried to make it return as fast as possible for irrelevant keys.

gjoseph92 · 2021-11-18T22:30:25Z

distributed/shuffle/shuffle_scheduler.py

+        # FIXME this feels very hacky/brittle.
+        # For example, after `df.set_index(...).to_delayed()`, you could create
+        # keys that don't have indices in them, and get fused (because they should!).
+        m = re.match(r"\(.+, (\d+)\)$", key)


This is my least favorite part of the implementation. Keys are supposed to be opaque to the scheduler (as far as I understand); we're inferring a lot of meaning from them.

Parsing the IDs out for transfer/barrier is okay, because we control those key names when we generate the graph. The downstream tasks could theoretically be named anything though.

perf nitpick

pattern = re.compile("<pattern>") # global var def foo(...): pattern.match(key)

I might want to see a unit test for this. I generally don't trust regular expressions, even though this one looks straight forward...

I don't feel comfortable with using such a logic at all. I don't think we should parse keys to infer logic. We do control them but this is not a well defined API and it is very hard to control whether or not keys are mutated at some point in time. The fusing you mentioned is the best example. Any kind of optimization or reordering would have the potential to break this contract easily and I think such systems should be disentangled

I figured re caching would handle it, but sure

Agreed a test would be good

I don't like this logic at all. As noted though, I can't think of any other options right now to have shuffling still work in the face of output task fusion, since HLG annotations are somewhat broken.

It's also important to understand that this key-parsing isn't here because this PR uses a scheduler plugin and #5520 doesn't.

This parsing is here because this PR supports Blockwise fusion of output tasks, and the other doesn't. Whether or not we used a scheduler plugin, the only way I can come up with right now to support fusion of output tasks is to parse key names.

Annotations are the ideal, proper way to deal with this. But if you use annotations, then fusion won't happen, so it defeats the purpose. Therefore, if there's no way to attach additional information to the graph (and there's no way to embed information in the tasks themselves and have it transmitted at runtime to the scheduler, because by the time the task is running, it's already too late), the only information you have is the key names.

gjoseph92 · 2021-11-18T22:48:30Z

distributed/shuffle/shuffle_scheduler.py

+            prefix, group, id = parts
+
+            if prefix == TASK_PREFIX:
+                if start == "waiting" and finish in ("processing", "memory"):


Suggested change

if start == "waiting" and finish in ("processing", "memory"):

if start == "waiting" and finish == "processing":

It should be impossible for the transfer task to go to memory before we've taken some action, since that task will block on us (the plugin) broadcasting shuffle info to all workers.

And if the barrier task is going to memory, that's bad news, because the dependents are about to run and we need to set restrictions on them.

I was just concerned that transition_waiting_memory exists. I'm not sure what triggers that case.

If in doubt, I prefer raising in these situations. If it is a valid case, we should know about it. Either way, an explicit exception protects us from corrupted state

All pass with dask/dask#8392. Rather crude; needs unit testing.

Surprisingly, blockwise decides to merge the two output layers. This really throws things off. The test passes right now by disabling an aggressive assertion, but we need more robust validation here.

Whenver I forget to switch to dask#5520, the errors are confusing.

9b9a68b

See dask#5524 (comment). Since messages from scheduler to workers remain ordered in `BatchedSend` (and TCP preserves ordering), we should be able to count on the `shuffle_init` always hitting the worker before the `add_partition` does, so long as we trust the transition logic of our plugin.

fjetter

I think hooking into the transition engine is fine. I'm not overly concerned about this but I'm not convinced about the other solutions/problems introduced here. Most notably how the worker restrictions are computed.

One different way coming to mind is that instead of worker restrictions, we could use resource restrictions. we could define every output partition as a unique resource. The shuffle init would then assign an output partition resource to every worker. The scheduler heuritics would then take care of the rest.

class ShuffleScheduler:
    def transfer(self, id, key):
        # Shuffle init part
        shuffle_ix = 0
        while shuffle_ix < npartition_out:
            for w in self.scheduler.workers:
                scheduler.add_resources(
                    worker=w,
                    resources={f"shuffle-{shuffle_ix}": 1},
                )
                shuffle_ix += 1
                if shuffle_ix == npartition_out:
                break

This would require us to encode this resource during graph construction but the total number of output partitions is known during graph construction so I don't see a conceptional problem, maybe a technical one, though. I'm not sure if we can assign resources in the blockwise layer. If not, adding this feature to to blockwise or dropping blockwise might be an option (is this actually a blockwise operation or are we just abusing this for some optimization hack??)

either way, arguing about resources and arguing about off-the-band output partitions feels semantically well aligned. Even when talking about task fusion, I would argue that whatever semantics should be true for resource should as well be true for the unpack tasks.

thoughts?

fjetter · 2021-12-08T17:35:28Z

distributed/shuffle/shuffle_scheduler.py

+            # TODO if these checks fail, we need to error the task!
+            # Otherwise it'll still run, and maybe even succeed, but just produce wrong data?
+
+            dts._worker_restrictions = restrictions


There is a scheduler API for this, isn't there?

There is, but it's a little bit more overhead when we already have the TaskState here:

distributed/distributed/scheduler.py

Lines 7168 to 7174 in d0b40d3

def set_restrictions(self, comm=None, worker=None):

ts: TaskState

for key, restrictions in worker.items():

ts = self.tasks[key]

if isinstance(restrictions, str):

restrictions = {restrictions}

ts._worker_restrictions = set(restrictions)

fjetter · 2021-12-08T17:36:41Z

distributed/shuffle/shuffle_scheduler.py

+
+class ShuffleSchedulerPlugin(SchedulerPlugin):
+    name: ClassVar[str] = "ShuffleSchedulerPlugin"
+    output_keys: dict[str, ShuffleId]


Shouldn't this rather be something like dict[ShuffleId, Set[str]]?

See:

distributed/distributed/shuffle/shuffle_scheduler.py

Lines 175 to 184 in 04833a3

# Task completed

if start in ("waiting", "processing") and finish in (

"memory",

"released",

"erred",

):

try:

id = self.output_keys[key]

except KeyError:

return

This lets us check which shuffle (if any) a given key is a part of.

Also note that this whole thing isn't actually necessary for the proper operation of a shuffle:

distributed/distributed/shuffle/shuffle_scheduler.py

Lines 114 to 120 in 04833a3

# Check if all output keys are done

# NOTE: we don't actually need this `unpack` step or tracking output keys;

# we could just delete the state in `barrier`.

# But we do it so we can detect duplicate shuffles, where a `transfer` task

# tries to reuse a shuffle ID that we're unpacking.

# (It does also allow us to clean up worker restrictions on error)

I added it just for sanity checking and validation right now.

fjetter · 2021-12-08T17:40:09Z

distributed/shuffle/shuffle_scheduler.py

+        return addr
+
+
+def parse_key(key: str) -> list[str] | None:


maybe premat optimization but I guess this should have at least an LRU cache

I like that idea, but this is going to run on every single transition, so I think that cache would get blown out. We could implement an internal LRU cache just for positives I guess.

fjetter · 2021-12-08T17:41:52Z

distributed/shuffle/shuffle_scheduler.py

+
+    def transition(self, key: str, start: str, finish: str, *args, **kwargs):
+        "Watch transitions for keys we care about"
+        parts = parse_key(key)


I don't really like us parsing the key to do this logic. I would likely prefer us using a task attribute or similar

I would rather even consider us adopting how tasks are submitted to the scheduler. I'm not too familar with HLG but shouldn't it be somehow "easy" to tell the scheduler what keys should be considered for this? isn't there some way during unpacking to let the scheduler know that the keys are "special" such that we just keep a set of "shuffle keys" on board instead of parsing them

Annotations seem like the mechanism for this. Unfortunately there are two blockers to using them:

Graph optimization loses annotations dask#7036

Blockwise fusion does not fuse across layers with different annotations. We need blockwise fusion to happen on the output tasks to prevent Workers run twice as many root tasks as they should, causing memory pressure #5223. Maybe we'd need some meta-annotation about whether an annotation can be fused?

So yes, it should be easy, but it will require a bit of fixing and changing blockwise for it to actually be easy.

I'm also thinking about getting rid of this transition-watching logic entirely, and having transfer tasks call an RPC on the scheduler to register themselves and get the list of peer workers (which would then also cause the scheduler to broadcast a message to all workers telling them the shuffle has started). This would eliminate the need for parse_key entirely. The only keys we'd have to parse are the output keys (for reasons mentioned in https://github.com/dask/distributed/pull/5524/files#r765203083). That would be more of a hybrid approach with #5520.

fjetter · 2021-12-08T17:44:12Z

distributed/shuffle/shuffle_scheduler.py

+            prefix, group, id = parts
+
+            if prefix == TASK_PREFIX:
+                if start == "waiting" and finish in ("processing", "memory"):


If in doubt, I prefer raising in these situations. If it is a valid case, we should know about it. Either way, an explicit exception protects us from corrupted state

fjetter · 2021-12-08T17:48:07Z

distributed/shuffle/shuffle_scheduler.py

+                    return self.erred(ShuffleId(id), key)
+
+        # Task completed
+        if start in ("waiting", "processing") and finish in (


This condition feels brittle. What motivates the selective start states?

Checking the start state is probably unnecessary. All we really care about is that the key is in a terminal state.

fjetter · 2021-12-08T18:01:06Z

distributed/shuffle/shuffle_scheduler.py

+        # FIXME this feels very hacky/brittle.
+        # For example, after `df.set_index(...).to_delayed()`, you could create
+        # keys that don't have indices in them, and get fused (because they should!).
+        m = re.match(r"\(.+, (\d+)\)$", key)


perf nitpick

pattern = re.compile("<pattern>") # global var def foo(...): pattern.match(key)

I might want to see a unit test for this. I generally don't trust regular expressions, even though this one looks straight forward...

I don't feel comfortable with using such a logic at all. I don't think we should parse keys to infer logic. We do control them but this is not a well defined API and it is very hard to control whether or not keys are mutated at some point in time. The fusing you mentioned is the best example. Any kind of optimization or reordering would have the potential to break this contract easily and I think such systems should be disentangled

gjoseph92 · 2021-12-08T20:39:00Z

One different way coming to mind is that instead of worker restrictions, we could use resource restrictions.

I like the spirit of this, because you're embedding some information at graph-generation time that lets you identify the partition number of an output task through something else than parsing its key. However, as noted in https://github.com/dask/distributed/pull/5524/files#r765203083 and #5524 (comment) this isn't possible right now:

Resource restrictions are implemented through HLG annotations
HLG annotations get lost in low-level optimization—if Consider reactivating low-level DataFrame optimization when not all layers are Blockwise dask#8447 happened, everything would break by default; even now relying on resource restrictions would break when using optimization.fuse.active: True.
Blockwise fusion doesn't happen right now across layers with different annotations.

Whether we use resource restriction annotations, or some custom shuffle annotation that then gets translated into worker restrictions, it's basically the same thing. (The scheduling mechanism for worker restrictions is slightly simpler than for resource restrictions, so I'd prefer sticking with that path.) It's all just a way of passing along auxiliary data so we have more information to go off of than just the key names.

is this actually a blockwise operation or are we just abusing this for some optimization hack??

This is definitely a proper blockwise optimization. And we really, really want subsequent blockwise optimizations to fuse onto it, otherwise we get root-task overproduction.

I would argue that whatever semantics should be true for resource should as well be true for the unpack tasks.

I don't think so. Consider

df = dd.read_parquet(...)
df_pre = df.map_partitions(preprocess)
with dask.annotate(resources={"GPU": 1}):
    inferred = df_pre.map_partitions(run_ml_model)
df_post = inferred.map_partitions(post_process)

Without the resource annotations, this whole thing would Blockwise-fuse into one layer (read_parquet-preprocess-run_ml_model-post_process). But with resource restrictions, you don't want that to happen—you want the CPU tasks to run on all workers, then send their data to the few GPU workers for inference, which then send their data back out.

But our case would use resources in basically the same way:

transfer = df.map_partitions(transfer)
barrier = delayed(barrier)(transfer)
with dask.annotate(resources={f"shuffle-{id}-{i}": 1 for i in range(transfer.npartitions)}):
    unpack = dd.map_partitions(unpack, BlockwiseRange(transfer.npartitions), barrier)

downstream = unpack.map_partitions(user_code)

However, we do want user_code to get fused with unpack. They both use resource restrictions, but in the real case we want the tasks to remain separate, and in our case we want them to fuse, and for the resource restrictions to propagate to the fused task.

I think we'll end up needing something like dask.annotate(**annotations: dict, fuse: Literal["prohibit" | "propagage" | "drop"] = "prohibit"), so you can explicitly control how fusion should apply to that annotation.

fjetter · 2021-12-09T09:38:51Z

Thanks for your replies.

I don't think what we would like to do in this PR is actually very special. I could see similar applications down the road that require us to slightly change scheduling logic for specific keys. Therefore I think this PR is a nice example and it should be considered as a requirement for future iterations of the HLE/HLG/annotation engine.

I'll have another peak at the other PR shortly but right now it feels like we should postpone the scheduler plugin until we have a more robust annotation / HLG / HLE engine.

gjoseph92 · 2021-12-09T22:53:08Z

we should postpone the scheduler plugin until we have a more robust annotation / HLG / HLE engine

I think what you're really talking about is postponing handling of Blockwise fusion of the output tasks, since handling fused output tasks is what requires us to parse keys right now, and parsing keys is what makes it more reasonable to use a scheduler plugin, instead of RPCs to the scheduler from within tasks like the other PR (because the scheduler would have to send a bunch of key names to the task, which would then parse them and send them back to the scheduler—might as well just do that all on the scheduler and save the communication).

I get this from a simplicity perspective, but I disagree for a few reasons:

When we have fixed the annotation machinery, so we can stop parsing keys and instead check annotations, all those annotations will still live on the scheduler. We will still want worker restrictions to be computed on the scheduler—whether that happens via RPC or via transition-watching is not very important.

Annotations are just an implementation detail. Specifically, it only changes how ShuffleSchedulerPlugin.worker_for_key and maybe ShuffleSchedulerPlugin.parse_key work internally—everything else could look the same as this PR.

The overall architecture, in terms of where the necessary data lives, is no different: the complete set of names of the tasks, and their annotations, only lives on the scheduler. So things which need that should run on the scheduler.
The P2P shuffle is way more useful in practice if it supports output-task fusion. Root-task overproduction will affect any real-world workload using the shuffle, even just df.set_index("foo").to_parquet().

Root task overproduction will make it harder for us to test and benchmark our improvements. It also may make it harder to get feedback from users, and less useful to users, because "it ran out of memory" will probably just be overproduction.

If we should postpone the scheduler plugin until we have a more robust annotation engine, should we also postpone it until we have Speculatively assign tasks to workers #3974, so root task overproduction isn't a problem, and we don't need to fuse? I would guess not. So in order to move forward without STA, we have to have output-task fusion. To have output-task fusion, we have to run logic on the scheduler to identify output task numbers, whether from annotations or from parsing keys.
Dealing with the "triangle" blockwise fusion that occurs during a merge is probably the very first thing that has to happen after one of these PRs gets merged. I don't see a way to do that without logic that runs on the scheduler.

(Also, the logic in the other PR is even more egregious than parsing keys, because it just generates the key names it expects based on the hardcoded name of the unpack function. This is highly brittle—just setting optimization.fuse.active: True will break the other PR. And with the other PR, it's not a simple change to handle that: you'll move the restriction-setting logic to the scheduler, so that you can parse key names... and you'll end up halfway to this PR.)

Another option

To simplify this PR and bring it closer to the original design, we could drop the transition-watching logic, and instead have tasks call RPCs on the scheduler. The transition-watching is really just an optimization to reduce communication, and probably an unnecessary one. I'd be happy to try that out. It would be a pretty quick change. Doing this is even more awkward IMO.

gjoseph92 · 2021-12-10T00:45:29Z

All this said, I could still see the rationale in merging the other PR first, then making this a new PR onto that (though the diff would be very hard to read) in order to have a place for more discussion about why we're moving logic to the scheduler. Or even for doing so incrementally. It would add a bit of extra work, but maybe that's worth it for the clearer process.

gjoseph92 · 2022-01-12T03:07:32Z

We've decided to merge #5520 as the initial skeleton instead of this one.

gjoseph92 added 3 commits November 17, 2021 23:21

First draft of p2p shuffle via scheduler plugin

13f4c30

Add graph and import on scheduler

a331e11

Ensure graph keys are what we expect

e3fb4b3

gjoseph92 marked this pull request as draft November 18, 2021 09:29

gjoseph92 force-pushed the p2p-shuffle/skeleton-scheduler-plugin branch 2 times, most recently from 1df7a61 to b56d15d Compare November 18, 2021 21:45

gjoseph92 commented Nov 19, 2021

View reviewed changes

gjoseph92 mentioned this pull request Nov 19, 2021

P2P shuffle skeleton #5520

Merged

2 tasks

gjoseph92 added 7 commits November 19, 2021 16:10

E2E shuffling tests

76a8219

All pass with dask/dask#8392. Rather crude; needs unit testing.

Test merges

9955b60

Surprisingly, blockwise decides to merge the two output layers. This really throws things off. The test passes right now by disabling an aggressive assertion, but we need more robust validation here.

Light docs

f79985a

Better error when not running on worker

2afc22c

Whenver I forget to switch to dask#5520, the errors are confusing.

Add responses to comments from

e3170ae

9b9a68b

Reuse some tests from dask#5520

04833a3

gjoseph92 force-pushed the p2p-shuffle/skeleton-scheduler-plugin branch from b56d15d to 04833a3 Compare November 19, 2021 23:10

gjoseph92 mentioned this pull request Nov 30, 2021

Run dask's shuffle tests in CI gjoseph92/distributed#1

Draft

fjetter self-requested a review December 8, 2021 14:00

fjetter requested changes Dec 8, 2021

View reviewed changes

gjoseph92 closed this Jan 12, 2022

fjetter mentioned this pull request Mar 14, 2022

P2P shuffle questions #5939

Open

gjoseph92 mentioned this pull request Mar 16, 2022

[Draft] Services for out-of-band operations #5948

Draft

gjoseph92 mentioned this pull request Apr 12, 2022

Shuffle service resilience #6105

Closed

gjoseph92 mentioned this pull request Jan 30, 2023

P2P shuffling and queuing combined may cause high memory usage with dask.dataframe.merge #7496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DNM] P2P shuffle skeleton - scheduler plugin #5524

[DNM] P2P shuffle skeleton - scheduler plugin #5524

gjoseph92 commented Nov 18, 2021 •

edited

Loading

gjoseph92 left a comment

gjoseph92 Nov 18, 2021

gjoseph92 Nov 18, 2021

gjoseph92 Nov 18, 2021

fjetter Dec 8, 2021

gjoseph92 Dec 8, 2021

gjoseph92 Dec 8, 2021

gjoseph92 Nov 18, 2021

fjetter Dec 8, 2021

fjetter left a comment

fjetter Dec 8, 2021

gjoseph92 Dec 8, 2021

fjetter Dec 8, 2021

gjoseph92 Dec 8, 2021

fjetter Dec 8, 2021

gjoseph92 Dec 8, 2021

fjetter Dec 8, 2021

fjetter Dec 8, 2021

gjoseph92 Dec 8, 2021

gjoseph92 Dec 8, 2021

fjetter Dec 8, 2021

fjetter Dec 8, 2021

gjoseph92 Dec 8, 2021

fjetter Dec 8, 2021

gjoseph92 commented Dec 8, 2021

fjetter commented Dec 9, 2021

gjoseph92 commented Dec 9, 2021 •

edited

Loading

gjoseph92 commented Dec 10, 2021

gjoseph92 commented Jan 12, 2022

	if start == "waiting" and finish in ("processing", "memory"):
	if start == "waiting" and finish == "processing":

	def set_restrictions(self, comm=None, worker=None):
	ts: TaskState
	for key, restrictions in worker.items():
	ts = self.tasks[key]
	if isinstance(restrictions, str):
	restrictions = {restrictions}
	ts._worker_restrictions = set(restrictions)

	# Task completed
	if start in ("waiting", "processing") and finish in (
	"memory",
	"released",
	"erred",
	):
	try:
	id = self.output_keys[key]
	except KeyError:
	return

	# Check if all output keys are done

	# NOTE: we don't actually need this `unpack` step or tracking output keys;
	# we could just delete the state in `barrier`.
	# But we do it so we can detect duplicate shuffles, where a `transfer` task
	# tries to reuse a shuffle ID that we're unpacking.
	# (It does also allow us to clean up worker restrictions on error)

[DNM] P2P shuffle skeleton - scheduler plugin #5524

[DNM] P2P shuffle skeleton - scheduler plugin #5524

Conversation

gjoseph92 commented Nov 18, 2021 • edited Loading

gjoseph92 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoseph92 commented Dec 8, 2021

fjetter commented Dec 9, 2021

gjoseph92 commented Dec 9, 2021 • edited Loading

Another option

gjoseph92 commented Dec 10, 2021

gjoseph92 commented Jan 12, 2022

gjoseph92 commented Nov 18, 2021 •

edited

Loading

gjoseph92 commented Dec 9, 2021 •

edited

Loading