Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Clarify failure modes of multiple workers and clusters #319

Open
tarqd opened this issue Feb 10, 2020 · 2 comments
Open

[Docs] Clarify failure modes of multiple workers and clusters #319

tarqd opened this issue Feb 10, 2020 · 2 comments

Comments

@tarqd
Copy link

tarqd commented Feb 10, 2020

After reading through the docs I'm a little unclear on what happens should a worker fail (process/thread dies, network partition, etc).

Main concerns:

  • Does the library panic?
  • If not how can I detect failure?
  • Can I still receive partial data from live workers if I partition my data appropriately?

If this is documented and I've simply missed it, I apologize in advance :)

@frankmcsherry
Copy link
Member

The intended behavior is that everything panics. This has not always been the case, but there is code in place to intentionally propagate panics across worker and communication threads, and across the network. If one worker crashing leaves the system in a "hung" state that's a bug.

There is no story for recovering partial data from live workers. It's a legit thing to think about.

The general "philosophy" has been that fault-tolerance and failure recovery is for a higher-level framework to support (one that has more opinions on what constitutes "correct behavior"). The analogy here is that your OS is some degree less "fault tolerant" than the DB that lives on top of it, because the former is less clear on what it should do when something goes wrong. Same thing here (differential dataflow would be the more sane place to talk about building in fault tolerance, and recovery of state).

I'll leave the issue open because I'm guessing the docs don't reflect these expectations.

@tarqd
Copy link
Author

tarqd commented Feb 10, 2020

Thank you for clarifying! Completely agree that fault tolerance is outside the scope of this library, arguably it'd be pretty hard to put in differential-dataflow even (sane behavior is very dependent on your dataflows, I'd imagine).

Can you override the "panic!" behavior from a higher level library? With that combined with a custom exchange operator, it seems like you can implement anything you want.

I can imagine for certain use-cases a world where workers could implement a full / incremental state transfer similar to what Galera does when a node is rejoining the cluster.

For example, imagine dataflows implementing a distributed policy/decision engine. Workers could advertise being having the ability to evaluate a subset of policies.

With some (read: a lot) of effort, you could create a system that allows you to dynamically add workers to the cluster and replicate the state required to evaluate policies based on load. Perhaps dynamically scheduling/replicating policy data to nodes as you go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants