Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failover problem in cluster mode #184

Closed
qiuxiafei opened this issue Jul 30, 2018 · 13 comments
Closed

Failover problem in cluster mode #184

qiuxiafei opened this issue Jul 30, 2018 · 13 comments

Comments

@qiuxiafei
Copy link
Contributor

qiuxiafei commented Jul 30, 2018

Hi, @frankmcsherry !

I've already setup timely as a long-running service to serve graph queries, each process holding a portion of graph data. But when some processes get down, workers and I/O thread in other processes seem simply ignore this event and keep running.
To address this problem, I have a simple workaround which shuts down all other processes by force and relaunches them again. This works but introduces extensive data loading work, including disk I/O, network traffic, etc.
It could be graceful if normal processes can be notified about the dead process and let their workers and I/O thread exit and gives the control back to user code. The user code then make a choice to exit too or wait for the dead process to be relaunched somewhere else. This may looks like:

loop {
    let guards = Timely::execute(/*...*/);
    let results = guards.join();
    if let Err(OtherProcessDown) = results {
        // wait for relaunched
    }
}
@frankmcsherry
Copy link
Member

Hello!

I think to start a good thing is to aim for clear behavior when there are crash failures, and I think the intended behavior is: if a worker crashes and any other worker attempts to send data to them, that worker will go down too. This is meant to effect some fate sharing, and I think it usually works. At least, I haven't had to manually shut down a distributed computation in quite a while.

What can happen is if your workers for whatever reason stop sending at a remote worker, perhaps they are effectively blocked on that worker waiting for it to send some progress data, then the system can hang. The reason here, I think, is that receiving from a closed socket is not fundamentally wrong, in that the remote worker may have completed successfully, and your worker just hasn't received all of the information about the clean exit yet (inbound from other workers).

Can you say more about why your workers are crashing? Is this timely code crashing, or user code? One of the reasons fault-tolerance hasn't been as painful for us as you might expect is that we see all the bugs and have to fix them; if you have some, let us know!

@frankmcsherry
Copy link
Member

Ooo, I forgot to say: we could certainly try and flesh out the "protocol", which currently starts with a worker's id and then just has random data until it closes, by adding some more messages about clean shutdown at least. This code is currently under a bit of work, but I'll try and make sure this is easy-ish to add.

@qiuxiafei
Copy link
Contributor Author

Hi!

I completely agree on your description about crash behavior, normal workers should get notified about crash events.

And as you've mentioned, normal workers may hang on a closed socket, which is possible. How about not just judge other workers' status by socket error implicitly and introduce something like heartbeat among workers?

We've been using timely for months, it's quite rare to see crashed caused by timely (very good ^_^). But in the system we're working on recently, there's a complex multi-version storage system which use unsafe raw pointer in rust. And we also have arbitrary user queries which may produce large amount of intermediate messages or state. And there may be User-Defained-Functions containing arbitrary user code which is beyond our control.
We are pushing the system to production, on one hand, we have to make effert to make our program bug-free (but usually impossible), and on the other hand, our system must have the capability to handle abnormal crashes. Both of these matter!

@frankmcsherry
Copy link
Member

A heartbeat sounds like a good idea. I think that this can be done without changing the system. If you create a new dataflow with just an input, and each worker ticks forward its timestamp periodically (you get to choose) this should create a message to send to each other worker, and a send to a closed socket should result in a panic.

If that doesn't work, it suggests something is wrong with the "sends to closed workers generate errors" implementation, and that is a bug we should fix. Is this something you can test and see if it resolves your issue?

With respect to returning to user code, I think we should be able to use the worker guards to do this, like in your example above. I haven't actually tried this though. Is it something that you know does not work? I think the .join() method on WorkerGuard should return a Vec<Result> from all of the worker threads, though perhaps the problem is that we are exploding the wrong threads (e.g. the communication threads, rather than worker threads).

@qiuxiafei
Copy link
Contributor Author

Yes, we've already had such a dataflow sending messages periodically, who's responsibility is to dispatch new dataflow description to all workers.
And we are now introducing an atomic state shared among workers and IO threads. It will be written by IO threads if there are errors and checked by workers in the stepping loops and checked by other IO threads too. This idea seems works and @bmmcq is working on a prototype. Maybe we can see some performance numbers soon.

@frankmcsherry
Copy link
Member

Just to check, you had such a dataflow and it did not result in worker shutdown when one of the workers failed? That would definitely be a bug (or a misunderstanding on my part).

The networking code is undergoing a bit of a revamp now, and should be a bit more coherent when it emerges. One of the problems is that one must distinguish between network shutdowns that are clean and responsible, and shutdowns that are sudden and wrong. A broken socket isn't fundamentally wrong unless i) it is a send socket and you still have data to send, or ii) it is a receive socket and the remote worker hasn't itself shut down.

Anyhow, the new networking code runs at the moment and we are just looking at some perf issues, so perhaps it will land relatively soon!

@qiuxiafei
Copy link
Contributor Author

Just to check, you had such a dataflow and it did not result in worker shutdown when one of the workers failed? That would definitely be a bug (or a misunderstanding on my part).

Maybe some misunderstanding here. What we have is just a dataflow broadcasting thing around. It's just a normal dataflow. Timely's current action on failed worker is as my first post, and it still holds:

But when some processes get down, workers and I/O thread in other processes seem simply ignore this event and keep running.

What we are trying to do is to let the IO threads broadcast network errors and stop workers. What we've missed is thant sometimes broken socket isn't wrong, because we've only consided service mode only.

@frankmcsherry
Copy link
Member

My current plan on this (over in the dataplane_mockup branch) is to have communications end with a hang-up message, so that receivers can determine if they received a clean disconnection or not. If they receive a clean disconnection the recv thread shuts down, and causes no problems. If they receive an unclean disconnection they can send a message to each of the workers, as if sending data but a different Result variant, and the workers can do whatever they like (probably panic, in a first version of this); they should probably find a way to cause their outgoing send thread to shutdown abnormally for fate-sharing purposes.

Both the send and recv threads use the same socket object (cloned) but I don't have a good sense for the partial failure models. E.g. is it possible that the send half of a socket can close locally without the recv half closing? The send thread panicking might remain a thing for a while to make sure to cover these cases.

@qiuxiafei
Copy link
Contributor Author

I did some experiments. Cloned TpcStreams can work independently. If the read half gets closed, the write half still works and vice versa.

@frankmcsherry
Copy link
Member

frankmcsherry commented Aug 13, 2018

Btw, the dataplane_mockup branch now has the "abort if not clean socket closure" detail put in. There are breaking changes for operator implementations (as part of the zero-copy interfaces), but if you are keen to give it a go at some point I could walk you through porting things (also noted in the PR, but perhaps not in sufficient detail).

@qiuxiafei
Copy link
Contributor Author

Thanks. I'll go through the PR.

@frankmcsherry
Copy link
Member

The dataplane_mockup branch should be landing soon, and has panic propagation between the network and worker threads. This should mean that if a worker panics it will cause a network shutdown, which will cause other receive threads to panic, which will cause all of their workers to panic.

It is a bit more awkward to get panic propagation within the process (sending to a panicking thread should panic, but just attempting to receive from it will not). We should be able to get that in by switching the Process allocator around, but it's not there yet (next step, let's say).

@frankmcsherry
Copy link
Member

#135 has landed in master and should address the issue of workers "hanging" in the case of a crash, at least in the case that networking is involved. There is still the case that it can happen when one uses only a single process (but multiple workers).

This doesn't fully solve your question about returning control to the main thread, which seems like it could be possible with careful checking of the worker guards. I have seen some aborts due to panics within panics, and this would also frustrate "returning control" rather than just exiting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants