Failover problem in cluster mode #184

qiuxiafei · 2018-07-30T06:44:44Z

I've already setup timely as a long-running service to serve graph queries, each process holding a portion of graph data. But when some processes get down, workers and I/O thread in other processes seem simply ignore this event and keep running.
To address this problem, I have a simple workaround which shuts down all other processes by force and relaunches them again. This works but introduces extensive data loading work, including disk I/O, network traffic, etc.
It could be graceful if normal processes can be notified about the dead process and let their workers and I/O thread exit and gives the control back to user code. The user code then make a choice to exit too or wait for the dead process to be relaunched somewhere else. This may looks like:

loop {
    let guards = Timely::execute(/*...*/);
    let results = guards.join();
    if let Err(OtherProcessDown) = results {
        // wait for relaunched
    }
}

frankmcsherry · 2018-07-31T08:07:31Z

Hello!

I think to start a good thing is to aim for clear behavior when there are crash failures, and I think the intended behavior is: if a worker crashes and any other worker attempts to send data to them, that worker will go down too. This is meant to effect some fate sharing, and I think it usually works. At least, I haven't had to manually shut down a distributed computation in quite a while.

What can happen is if your workers for whatever reason stop sending at a remote worker, perhaps they are effectively blocked on that worker waiting for it to send some progress data, then the system can hang. The reason here, I think, is that receiving from a closed socket is not fundamentally wrong, in that the remote worker may have completed successfully, and your worker just hasn't received all of the information about the clean exit yet (inbound from other workers).

Can you say more about why your workers are crashing? Is this timely code crashing, or user code? One of the reasons fault-tolerance hasn't been as painful for us as you might expect is that we see all the bugs and have to fix them; if you have some, let us know!

frankmcsherry · 2018-07-31T08:08:45Z

Ooo, I forgot to say: we could certainly try and flesh out the "protocol", which currently starts with a worker's id and then just has random data until it closes, by adding some more messages about clean shutdown at least. This code is currently under a bit of work, but I'll try and make sure this is easy-ish to add.

qiuxiafei · 2018-07-31T14:58:44Z

Hi!

I completely agree on your description about crash behavior, normal workers should get notified about crash events.

And as you've mentioned, normal workers may hang on a closed socket, which is possible. How about not just judge other workers' status by socket error implicitly and introduce something like heartbeat among workers?

We've been using timely for months, it's quite rare to see crashed caused by timely (very good ^_^). But in the system we're working on recently, there's a complex multi-version storage system which use unsafe raw pointer in rust. And we also have arbitrary user queries which may produce large amount of intermediate messages or state. And there may be User-Defained-Functions containing arbitrary user code which is beyond our control.
We are pushing the system to production, on one hand, we have to make effert to make our program bug-free (but usually impossible), and on the other hand, our system must have the capability to handle abnormal crashes. Both of these matter!

frankmcsherry · 2018-08-01T07:45:01Z

A heartbeat sounds like a good idea. I think that this can be done without changing the system. If you create a new dataflow with just an input, and each worker ticks forward its timestamp periodically (you get to choose) this should create a message to send to each other worker, and a send to a closed socket should result in a panic.

If that doesn't work, it suggests something is wrong with the "sends to closed workers generate errors" implementation, and that is a bug we should fix. Is this something you can test and see if it resolves your issue?

With respect to returning to user code, I think we should be able to use the worker guards to do this, like in your example above. I haven't actually tried this though. Is it something that you know does not work? I think the .join() method on WorkerGuard should return a Vec<Result> from all of the worker threads, though perhaps the problem is that we are exploding the wrong threads (e.g. the communication threads, rather than worker threads).

qiuxiafei · 2018-08-03T02:09:18Z

Yes, we've already had such a dataflow sending messages periodically, who's responsibility is to dispatch new dataflow description to all workers.
And we are now introducing an atomic state shared among workers and IO threads. It will be written by IO threads if there are errors and checked by workers in the stepping loops and checked by other IO threads too. This idea seems works and @bmmcq is working on a prototype. Maybe we can see some performance numbers soon.

frankmcsherry · 2018-08-03T05:34:45Z

Just to check, you had such a dataflow and it did not result in worker shutdown when one of the workers failed? That would definitely be a bug (or a misunderstanding on my part).

The networking code is undergoing a bit of a revamp now, and should be a bit more coherent when it emerges. One of the problems is that one must distinguish between network shutdowns that are clean and responsible, and shutdowns that are sudden and wrong. A broken socket isn't fundamentally wrong unless i) it is a send socket and you still have data to send, or ii) it is a receive socket and the remote worker hasn't itself shut down.

Anyhow, the new networking code runs at the moment and we are just looking at some perf issues, so perhaps it will land relatively soon!

qiuxiafei · 2018-08-03T09:17:22Z

Just to check, you had such a dataflow and it did not result in worker shutdown when one of the workers failed? That would definitely be a bug (or a misunderstanding on my part).

Maybe some misunderstanding here. What we have is just a dataflow broadcasting thing around. It's just a normal dataflow. Timely's current action on failed worker is as my first post, and it still holds:

But when some processes get down, workers and I/O thread in other processes seem simply ignore this event and keep running.

What we are trying to do is to let the IO threads broadcast network errors and stop workers. What we've missed is thant sometimes broken socket isn't wrong, because we've only consided service mode only.

frankmcsherry · 2018-08-06T13:56:15Z

My current plan on this (over in the dataplane_mockup branch) is to have communications end with a hang-up message, so that receivers can determine if they received a clean disconnection or not. If they receive a clean disconnection the recv thread shuts down, and causes no problems. If they receive an unclean disconnection they can send a message to each of the workers, as if sending data but a different Result variant, and the workers can do whatever they like (probably panic, in a first version of this); they should probably find a way to cause their outgoing send thread to shutdown abnormally for fate-sharing purposes.

Both the send and recv threads use the same socket object (cloned) but I don't have a good sense for the partial failure models. E.g. is it possible that the send half of a socket can close locally without the recv half closing? The send thread panicking might remain a thing for a while to make sure to cover these cases.

qiuxiafei · 2018-08-07T12:32:20Z

I did some experiments. Cloned TpcStreams can work independently. If the read half gets closed, the write half still works and vice versa.

frankmcsherry · 2018-08-13T14:10:31Z

Btw, the dataplane_mockup branch now has the "abort if not clean socket closure" detail put in. There are breaking changes for operator implementations (as part of the zero-copy interfaces), but if you are keen to give it a go at some point I could walk you through porting things (also noted in the PR, but perhaps not in sufficient detail).

qiuxiafei · 2018-08-17T02:27:28Z

Thanks. I'll go through the PR.

frankmcsherry · 2018-09-04T12:29:47Z

The dataplane_mockup branch should be landing soon, and has panic propagation between the network and worker threads. This should mean that if a worker panics it will cause a network shutdown, which will cause other receive threads to panic, which will cause all of their workers to panic.

It is a bit more awkward to get panic propagation within the process (sending to a panicking thread should panic, but just attempting to receive from it will not). We should be able to get that in by switching the Process allocator around, but it's not there yet (next step, let's say).

frankmcsherry · 2018-09-16T07:53:20Z

#135 has landed in master and should address the issue of workers "hanging" in the case of a crash, at least in the case that networking is involved. There is still the case that it can happen when one uses only a single process (but multiple workers).

This doesn't fully solve your question about returning control to the main thread, which seems like it could be possible with careful checking of the worker guards. I have seen some aborts due to panics within panics, and this would also frustrate "returning control" rather than just exiting.

frankmcsherry mentioned this issue Aug 2, 2018

To handle TCP broken pipe errors #185

Closed

qiuxiafei closed this as completed Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failover problem in cluster mode #184

Failover problem in cluster mode #184

qiuxiafei commented Jul 30, 2018 •

edited

Loading

frankmcsherry commented Jul 31, 2018

frankmcsherry commented Jul 31, 2018

qiuxiafei commented Jul 31, 2018

frankmcsherry commented Aug 1, 2018

qiuxiafei commented Aug 3, 2018

frankmcsherry commented Aug 3, 2018

qiuxiafei commented Aug 3, 2018

frankmcsherry commented Aug 6, 2018

qiuxiafei commented Aug 7, 2018

frankmcsherry commented Aug 13, 2018 •

edited

Loading

qiuxiafei commented Aug 17, 2018

frankmcsherry commented Sep 4, 2018

frankmcsherry commented Sep 16, 2018

Failover problem in cluster mode #184

Failover problem in cluster mode #184

Comments

qiuxiafei commented Jul 30, 2018 • edited Loading

frankmcsherry commented Jul 31, 2018

frankmcsherry commented Jul 31, 2018

qiuxiafei commented Jul 31, 2018

frankmcsherry commented Aug 1, 2018

qiuxiafei commented Aug 3, 2018

frankmcsherry commented Aug 3, 2018

qiuxiafei commented Aug 3, 2018

frankmcsherry commented Aug 6, 2018

qiuxiafei commented Aug 7, 2018

frankmcsherry commented Aug 13, 2018 • edited Loading

qiuxiafei commented Aug 17, 2018

frankmcsherry commented Sep 4, 2018

frankmcsherry commented Sep 16, 2018

qiuxiafei commented Jul 30, 2018 •

edited

Loading

frankmcsherry commented Aug 13, 2018 •

edited

Loading