[Docs] Clarify failure modes of multiple workers and clusters #319

tarqd · 2020-02-10T17:34:21Z

After reading through the docs I'm a little unclear on what happens should a worker fail (process/thread dies, network partition, etc).

Main concerns:

Does the library panic?
If not how can I detect failure?
Can I still receive partial data from live workers if I partition my data appropriately?

If this is documented and I've simply missed it, I apologize in advance :)

frankmcsherry · 2020-02-10T18:00:00Z

The intended behavior is that everything panics. This has not always been the case, but there is code in place to intentionally propagate panics across worker and communication threads, and across the network. If one worker crashing leaves the system in a "hung" state that's a bug.

There is no story for recovering partial data from live workers. It's a legit thing to think about.

The general "philosophy" has been that fault-tolerance and failure recovery is for a higher-level framework to support (one that has more opinions on what constitutes "correct behavior"). The analogy here is that your OS is some degree less "fault tolerant" than the DB that lives on top of it, because the former is less clear on what it should do when something goes wrong. Same thing here (differential dataflow would be the more sane place to talk about building in fault tolerance, and recovery of state).

I'll leave the issue open because I'm guessing the docs don't reflect these expectations.

tarqd · 2020-02-10T18:26:31Z

Thank you for clarifying! Completely agree that fault tolerance is outside the scope of this library, arguably it'd be pretty hard to put in differential-dataflow even (sane behavior is very dependent on your dataflows, I'd imagine).

Can you override the "panic!" behavior from a higher level library? With that combined with a custom exchange operator, it seems like you can implement anything you want.

I can imagine for certain use-cases a world where workers could implement a full / incremental state transfer similar to what Galera does when a node is rejoining the cluster.

For example, imagine dataflows implementing a distributed policy/decision engine. Workers could advertise being having the ability to evaluate a subset of policies.

With some (read: a lot) of effort, you could create a system that allows you to dynamically add workers to the cluster and replicate the state required to evaluate policies based on load. Perhaps dynamically scheduling/replicating policy data to nodes as you go.

conradoverta mentioned this issue Aug 17, 2023

Dynamically adding/removing workers #529

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] Clarify failure modes of multiple workers and clusters #319

[Docs] Clarify failure modes of multiple workers and clusters #319

tarqd commented Feb 10, 2020

frankmcsherry commented Feb 10, 2020

tarqd commented Feb 10, 2020

[Docs] Clarify failure modes of multiple workers and clusters #319

[Docs] Clarify failure modes of multiple workers and clusters #319

Comments

tarqd commented Feb 10, 2020

frankmcsherry commented Feb 10, 2020

tarqd commented Feb 10, 2020