-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] Clarify failure modes of multiple workers and clusters #319
Comments
The intended behavior is that everything panics. This has not always been the case, but there is code in place to intentionally propagate panics across worker and communication threads, and across the network. If one worker crashing leaves the system in a "hung" state that's a bug. There is no story for recovering partial data from live workers. It's a legit thing to think about. The general "philosophy" has been that fault-tolerance and failure recovery is for a higher-level framework to support (one that has more opinions on what constitutes "correct behavior"). The analogy here is that your OS is some degree less "fault tolerant" than the DB that lives on top of it, because the former is less clear on what it should do when something goes wrong. Same thing here (differential dataflow would be the more sane place to talk about building in fault tolerance, and recovery of state). I'll leave the issue open because I'm guessing the docs don't reflect these expectations. |
Thank you for clarifying! Completely agree that fault tolerance is outside the scope of this library, arguably it'd be pretty hard to put in differential-dataflow even (sane behavior is very dependent on your dataflows, I'd imagine). Can you override the "panic!" behavior from a higher level library? With that combined with a custom exchange operator, it seems like you can implement anything you want. I can imagine for certain use-cases a world where workers could implement a full / incremental state transfer similar to what Galera does when a node is rejoining the cluster. For example, imagine dataflows implementing a distributed policy/decision engine. Workers could advertise being having the ability to evaluate a subset of policies. With some (read: a lot) of effort, you could create a system that allows you to dynamically add workers to the cluster and replicate the state required to evaluate policies based on load. Perhaps dynamically scheduling/replicating policy data to nodes as you go. |
After reading through the docs I'm a little unclear on what happens should a worker fail (process/thread dies, network partition, etc).
Main concerns:
If this is documented and I've simply missed it, I apologize in advance :)
The text was updated successfully, but these errors were encountered: