Fix couple of reliability problems in the first example #1

matklad · 2020-04-26T11:57:14Z

First, remove explicit Exit messages. They are unnecessary as
channel's closure is a perfectly fine exit message in itself. They are
harmful, because they add a separate code path (as you need to handle
both channel's closure and exit event). They are double harmful,
because these different code-paths correspond to normal and abnormal
termination. Abnormal termination is never tested in practice, and
using a completely separate code path for it is a bad idea, because
that's why how complex distributed systems fail:

https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf

This also just makes the code shorter, because we can write a for loop.

Second, make sure that we join worker thread. That makes sure that
don't leak the thread outside of the scope of test, and that we also
propagate it's panic, if any. Note that the current solution with
explicit .join is still wrong, because the thread won't be joined on
panic, but to handle that more code/custom crate is required.

Third, unwrap all sends. Send only fails if the receiving side has
terminated abormally. In this case, we should escalate failure rather
than continue business as usual, lest we create a deadlock/resource
leak.

*First*, remove explicit `Exit` messages. They are unnecessary as channel's closure is a perfectly fine exit message in itself. They are harmful, because they add a separate code path (as you need to handle both channel's closure and exit event). They are double harmful, because these different code-paths correspond to normal and abnormal termination. Abnormal termination is never tested in practice, and using a completely separate code path for it is a bad idea, because that's why how complex distributed systems fail: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf This also just makes the code shorter, because we can write a `for` loop. *Second*, make sure that we join worker thread. That makes sure that don't leak the thread outside of the scope of test, and that we also propagate it's panic, if any. Note that the current solution with explicit `.join` is still wrong, because the thread won't be joined on panic, but to handle that more code/custom crate is required. *Third*, unwrap all sends. Send only fails if the receiving side has terminated abormally. In this case, we should escalate failure rather than continue business as usual, lest we create a deadlock/resource leak.

gterzian · 2020-05-12T12:23:37Z

Thanks for this PR, and sorry for having closed it, that was accidental and I think a result of force pushing to the branch.

First, remove explicit Exit messages.

I disagree. There are cases where you keep a sender around until shut-down. Real world use case: in Servo, many components keeps many senders around(see the constellation for example), sometimes to themselves, in order to pass clones around to new components that are started from time to time. So in those situations, senders will be kept around until shutdown, and instead explicit exit messages are used.

I think code can make a difference between the normal "exit" path and the abnormal "the sender was dropped" path, and I think one can test for that too.

I actually made a point of including explicit exit messages in the examples, because I think relying on dropping channels only works in some cases, usually pretty small and focused pieces of codes, and that it's not an approach that scales well to larger projects with more complicated workflows(so even thought the example at hand is simple, I'd rather make a point by including an exit workflow that I think is a better idea to follow in general).

This also just makes the code shorter, because we can write a for loop.

I don't see that as a big advantage, because in the real world there will be more than one message variant anyway, so you'll have to match on those instead of using a loop on a single message variant.

Second, make sure that we join worker thread.

Maybe, although in this example I don't think it matters much. I think explicit exit workflows can actually ensure the thread isn't "leaked", and also I could have unwrapped the creation of the thread, but I don't care much about the failure case in this example.

I would only do it if the point of the exercise was to show how to handle it, and as you point out fully handling this would require some complexity that is out of the scope of this exercise.

Third, unwrap all sends.

Again, I don't think it matters much, and it's not something that you always want to do in the real world either.

we should escalate failure rather
than continue business as usual, lest we create a deadlock/resource
leak.

I think this will depend on the case at hand, and it's mostly outside of the scope of the test. And simply not unwrapping a send in and of itself isn't an anti-pattern in my opinion. The constellation in Servo is again an example of a component where we do not want to unwrap on failing sends(although some of those sends are actually over IPC, where failure on the other end is perhaps more recoverable from than if it was in the same process).

So whether unwrapping a send makes sense or not, I think depends on the actual use-case.

In conclusion, I think some of your changes might be totally valid in some case, and in others they might not be. I don't agree that not following your suggestions amount to anti-patterns in and of itself, and I actually think in the context of the current examples, it would unnecessarily complicate matters or reduce the intended focus of the code example.

matklad · 2020-05-12T13:30:39Z

I think we can agree to disagree :-) I haven't studied servo's constellation in detail, but I am willing to bet that it uses channels wrong. On the one hand, this is a stupidly bold statement for me to make, as I don't know use-case well. On the other hand, I repeatedly had the experience of refactoring convoluted exit-message base communication to streamlined drop-based ones, and this experience makes me convinced there's some kind of categorical law of nature here (specifically, that observer is the dual of iterator :P), that the drop-based shape of code is right. This seem to be correlated with complexity of the code in question -- the more complicated the code is, the bigger savings come from getting rid of Exit Messages. My gut feeling that "servo needs exit messages because it is complicated" implication is wrong, I think it's more like "keeping complex codebase simple is very hard, because complexity just accretes with time". Again, I might be wrong here, but my gut feeling is that it's Servo who is wrong :-)

Specifically, I think "senders will be kept around until shutdown, and instead explicit exit messages are used" is wrong. I believe a better solution is either of the two:

keep sender as Ctx { sender: Sender<T> }, and rely on dropping Ctx to close sender
keep sender as Ctx { sender: Option<Sender<T>>}, which allows to stop/start sender on a more fin-grained level, and also makes "the peer might be shut down" explicit at the call site (ie, the bug that you send some other messages after termination is impossible statically)

I don't have enough free time on my hand to dig into a real servo code-base, but I would like to look at the smaller toy example, which has all essential complexity.

gterzian · 2020-05-12T13:55:36Z

I can agree that a drop based exit workflow can be elegant, and I disagree with the statement that all use of explicit exit workflow would be an anti-pattern in and of itself.

Having said that, clean shutdown of Servo still is a major challenge.

And I also think this challenge is not a result of people having somehow missed the potential of "just exit when senders drop". This approach is actually followed at times, when appropriate, see for example https://github.com/servo/servo/blob/c5617efff0f77ce79a7827668cf929def2d9337a/components/background_hang_monitor/background_hang_monitor.rs#L246

I think I agree with you that a system where dropping of senders creates a reliable shutdown signal that travels across components is probably optimal, and it's just that in Servo it sometimes doesn't work.

There are cases where senders are kept alive, because they need to be cloned and passed to newly spawned components.

I guess we could do something that would still allow to rely on dropping for exiting, however in these cases I think it might actually be more complicated then an actual exit message.

I can't evaluate your suggestion with those "contexts", although they appear to me as client-side concepts, whereas in Servo it's more like components sometimes keeping senders to themselves alive(not client components keeping those alive, although that might also be the case in some places) and off course you are welcome to take a look and provide more concrete suggestions.

matklad · 2020-05-12T15:37:09Z

whereas in Servo it's more like components sometimes keeping senders to themselves alive

I think I've seen this pattern in channel's code I was working with, and it indeed took me a while to untangle it. I am not sure what is you case exactly, but my case does sound pretty similar.

We have a component, that handles messages. The messages can be either commands from the outside world, or responses from the workers, spawned by the component itself. So, component works by draining a receiver, but it also holds a sender for this receiver, and it clones this sender when it spawns a new worker:

use std::thread;

use crossbeam_channel::{unbounded, Sender};

#[derive(Debug)]
enum ComponentMessage {
    FromOutside,
    FromWorker,
}

struct Component {
    sender: Sender<ComponentMessage>,
}

impl Component {
    fn start() -> (Sender<ComponentMessage>, thread::JoinHandle<()>) {
        let (sender, receiver) = unbounded();
        let thread = thread::spawn({
            let sender = sender.clone();
            move || {
                let component = Component {
                    sender: sender.clone(),
                };
                component.spawn_worker();
                for msg in receiver {
                    println!("{:?}", msg);
                }
            }
        });

        (sender, thread)
    }

    fn spawn_worker(&self) {
        let _thread = thread::spawn({
            let sender = self.sender.clone();
            move || {
                for _ in 0..10 {
                    sender.send(ComponentMessage::FromWorker).unwrap()
                }
            }
        });
    }
}

fn main() {
    let (sender, handle) = Component::start();
    sender.send(ComponentMessage::FromOutside).unwrap();
    drop(sender);
    println!("joining ...");
    handle.join().unwrap();
    println!("... done!"); // nope, will deadlock :-(
}

The problem I had with this system is that shutdown was tricky. Basically, the whole system is a giant feedback loop, because receiving a message from the main channel can, indirectly push new messages into the receiver. Even if you close the incoming from the outside bit of channel, you still have an internal cycle between the component and the workers, and this cycle can pump messages indefinitely. You can forcibly stop the feedback loop by just cutting the cycle at arbitrary point, but that means that in-progress messages would be dropped on the floor, which might be OK, but won't be elegant.

After I realized that the core problem with shutdown is this "feedback loop" effect, I've figured a way to structure the code so that the loop does not arise in the first place. The idea is to use separate channels for communication with the outside world and for communication with the workers, and select! between the two. That way, you can notice that the outside world closed the channel, and
then drain the pending messages correctly.

use std::thread;

use crossbeam_channel::{select, unbounded, Sender};

#[derive(Debug)]
struct FromOutside;

// Note, these are now private
#[derive(Debug)]
struct FromWorker;

#[derive(Debug)]
enum ComponentMessage {
    FromOutside(FromOutside),
    FromWorker(FromWorker),
}

struct Component {
    sender: Sender<FromWorker>,
}

impl Component {
    fn start() -> (Sender<FromOutside>, thread::JoinHandle<()>) {
        let (sender, receiver) = unbounded();
        let thread = thread::spawn({
            move || {
                let (worker_sender, worker_receiver) = unbounded();
                let component = Component {
                    sender: worker_sender,
                };

                component.spawn_worker();
                loop {
                    let msg = select! {
                        recv(worker_receiver) -> msg => {
                            ComponentMessage::FromWorker(msg.unwrap())
                        }
                        recv(receiver) -> msg => {
                            match msg {
                                Ok(msg) => ComponentMessage::FromOutside(msg),
                                Err(_) => {
                                    drop(receiver);
                                    break
                                },
                            }
                        }
                    };
                    println!("{:?}", msg);
                }
                drop(component);
                for msg in worker_receiver {
                    eprintln!("{:?}", msg)
                }
            }
        });

        (sender, thread)
    }

    fn spawn_worker(&self) {
        let _thread = thread::spawn({
            let sender = self.sender.clone();
            move || {
                for _ in 0..10 {
                    sender.send(FromWorker).unwrap()
                }
            }
        });
    }
}

fn main() {
    let (sender, handle) = Component::start();
    sender.send(FromOutside).unwrap();
    drop(sender);
    println!("joining ...");
    handle.join().unwrap();
    println!("... done!")
}

The drop-based cancellation here is the side effect, the core idea is that feedback loops are really bad for shutdown, and that you can refactor them away by splitting a common channel into two.

gterzian · 2020-05-12T17:34:36Z

Yes I think that's a good design, and as you know, you're not actually shutting down in response to the drop, rather you're not handling further messages from "outside", but you're still draining the workers channel.

I actually think this shows how business logic gets complicated pretty quickly, and I might actually be in favor of explicit exit messages for that reason alone.

For example, what if you have two exit modes, one in which you want to still drain and handle the worker messages, and one where you don't(maybe draining the workers almost always makes sense, but then you could some other distinction between exit modes)? The way to model that, in my opinion, would be with two separate exit messages.

So I think handling the Err(_) is elegant, and it might also have a limit in terms of modelling richer business logic.

I think you're totally right that it's usually when a channel is shared between the outside and inside world that you the sender is kept alive. I've just had a look in Servo for example, and the ScriptThread keeps around a sender to itself at https://github.com/servo/servo/blob/44ac62ddda3652e0f6c79bb4dc77ef389645cf37/components/script/script_thread.rs#L579

And this is cloned and passed to various entities you could call "workers" of that component, for example at https://github.com/servo/servo/blob/44ac62ddda3652e0f6c79bb4dc77ef389645cf37/components/script/script_thread.rs#L2385

Actually the doc of the actual port sums it up nicely with:

/// The port on which the constellation and layout threads can communicate with the script thread.

So yes the problem there is the "constellation and layout threads" part, it would be more consistent to have two separate channels for that, since it's only the constellation that can send the actual exit message.

And I think this is an argument for clearly separating concerns with different channels, not necessarily an argument for always using an Err(_) on a certain channel as an exit message. So yes if the channels are clearly separated, then usually you'll be able to drop the one from the "outside" world and signal shutdown, and in my opinion you could still do it with an explicit exit message. The important is rather the separation of channels, not so much the specific exit signal.

There's another one at https://github.com/servo/servo/blob/44ac62ddda3652e0f6c79bb4dc77ef389645cf37/components/compositing/compositor.rs#L397

This time, instead of sending the Exit message, and then setting the state to ShutdownState::ShuttingDown, and then wait for the shut down confirmation, I guess we could make self.constellation_chan be an option and set it to None, or model ShutdownState instead as

pub enum ShutdownState {
    NotShuttingDown(Sender<ConstellationMsg>),
    ShuttingDown,
    FinishedShuttingDown,
}

So that when we switch the state to ShuttingDown, the channel is dropped, and we can remove the explicit exit message.

However in both cases, are we really talking about "anti-patterns" when using the exit messages?

I'm not so sure, I might tend in favor of explicit messages, just to make a difference between "the compositor went away" and "we're starting to shut-down".

Also, in your second example above, when receiving a worker messages you always unwrap it, so clearly you don't see an Err(_) as some sort of shutdown signal to be handled, so It hink that's also a limitation of the "exit by dropping" approach, usually there is one "master" channel in the select, and that's the one where a proper exit signal can come from. It's not like any Err(_) on any channel is always a clean shutdown signal. So instead of making a difference between channels, and having some logic in your select where in some case an Err(_) is an exit signal, and in others it isn't, why not just have an explicit exit message? Isn't that actually clearer?

To me, both approaches are really quite similar, and when having lots of channels on all sides I might favor explicit messages just to make it easier for anyone to understand. I don't think that's an anti-pattern in itself that necessarily leads to weird deadlock type of situations like you hinted at in your first example. I do agree that better separating the various channels is better in all cases(and that it does usually make it possible to enable dropping channels when shutting down).

By the way I do think the joining might be a good idea to do more systematically in Servo.

For example if you look at when a BHM is started at https://github.com/servo/servo/blob/44ac62ddda3652e0f6c79bb4dc77ef389645cf37/components/background_hang_monitor/background_hang_monitor.rs#L26

So this call actually returns another trait, one which the caller can store and later use to register stuff for monitoring.

The implementation of the trait contains the sender, and I guess it could also contain the join handle, and then we dropping it could drop the sender, and then join on the worker thread.

We should do this more often I think. It might solve some issues with some threads still running post shutdown.

matklad · 2020-05-12T22:59:47Z

I agree that the above examples are more about structure of communication, rather than shutdown mechanism, but I do think they are relevant. More on it later :-)

I don't buy this argument:

I actually think this shows how business logic gets complicated pretty quickly, and I might actually be in favor of explicit exit messages for that reason alone.

This same reasoning can be applied to the statement: "for all resources, it's better to provide explicit fn close(self) cleanup function, rather than rely on Drop". It is true that explicit cleanup is more flexible, at is also more obvious for the reader. Nonetheless, I think it's a commonly accepted wisdom that implicit C++ style RAII based resource management is more robust that explicit Java-style close calls. So, I don't see how this argument argues that default should be an explicit exit message.

With general resources, there are cases where explicit close is better, specifically, if it is actually fn close(self) -> io::Result<()>, like for example flush in buf reader. However, this case is clearly an exception. I might be convinced that, in some very specific circumstances, we genuinely need an explicit Exit message, but I haven't seen concrete examples of this thus far.

I want to argue that drop-based cancellation is a much better default, on the grounds that it throws you into a pit of success. This is why my two above examples are relevant -- with exit messages, you can fix both of them. With drop, you can only write the second example, the first one just won't work.

Drop-based cancellation works only in cases where your communication structure is a tree, while exit messages work with arbitrary graphs. That means that, if you go with drop-only solution, you guarantee that all components form a tree, which can be trivially teared down. With exit messages, you can create arbitrary complex structures.

This is just the usual Rust magic, applied to communication. The default architecture of GC application is "a sea of objects", the default architecture of Rust program is "tree of ownership". Arbitrary graph of objects is certainly more powerful, and you can cook it right, if you try hard enough. But most Rust programmers agree that restricting yourself to trees does not really reduce necessary expressiveness most of the time, at the cost of simplifying the software considerably :-)

A specific example of the pit of success is how switching to drop just magically made the second step as good as the third one. This was a complete surprise to me, I didn't planed for this, I just went to the third step after the second one and realized "oups, I've already done this".

Finally, I still maintain that drop has significant meaningful benefits for abnormal shutdown. If you have ? or hit a panic, you won't send an Exit message, and stuff that waits for it might leak or deadlock. Like, in the first example, if we add Exit message and don't send it, the worker and component would be just forever locked in a cycle, detached from the rest of the object graph. With drop, you get correct shutdown on errors for free. What's more, you unify code paths for normal and abnormal termination, which gives you crash only software, which is a good thing. The fact that the illegal state of sending another message after Shutdown is unrepresentable is another nice touch :-)

matklad · 2020-05-12T23:08:25Z

Also, in your second example above, when receiving a worker messages you always unwrap it, so clearly you don't see an Err(_) as some sort of shutdown signal to be handled,

So, the .unwrap there is statically unreachable, because we have the corresponding receiver in the same stack frame. So I don't ignore termination here, it's just not possible. It's possible to reshape the code to fold both pre and post shutdown cases into the single loop, but I prefer the structure with two loops as it more clearly show the termination phase, where we can't get new work, but have some pending jobs still.

let (worker_sender, worker_sender) = ...;
let outside_receiver = Some(&outside_receiver);
let worker_receiver = Some(&worker_receiver);
while outside_receiver.is_some() || worker_receiver.is_some() {
  select! {
    recv(outside_receiver.unwrap_or(&never())) -> msg => match msg {
      Ok(it) => { ... },
      Err(_) => { worker_sender = None; continue; }
    }
    recv(worker_receiver.unwrap_or(&never)) -> msg => match msg {
      Ok(it) => { ... },
      Err(_) => { continue; }
    }
  }
}

gterzian · 2020-05-13T04:23:13Z

Yes I think I can actually agree that your approach is probably better.

When I use exit signals, I might be doing something explicitly that can indeed be modeled implicitly using the channels themselves, and this indeed can prevent a whole class of errors in the business logic.

I tried to come-up with an example where this would fail, by adding certain constraints to the example, yet so far it's actually surprising how indeed the dropping channel makes it seemingly impossible to get it wrong.

I used to be a big fan of this drop based approach, and then I somehow moved more towards explicit exit workflows, perhaps thinking those would scale better for complicated flows, however the drop-based approach actually appears to allow to arbitrary complex workflows, and actually make them more robust.

I'll be looking more at the drop-based approach when designing new workflows going forward. I'll also try to incorporate joining on the threads at shutdown, and I guess unwrapping all sends, when combined with the other two, actually makes a lot of sense too.

Thanks for persistently bringing this to my attention.

Not sure if you want to re-open this and rebase your changes so it can be merged? Otherwise I can just add a link to this discussion on the readme of the repo, so people can see the different approach. In any case, as I write further article, they will probably highlight the approach you advocated here...

matklad · 2020-05-13T07:43:13Z

Thanks for listening! I would be very curious to hear how this ideas work in the context of servo.

I've tried rebasing, but there were some non-trivial to fix conflicts, so I abandoned that plan :)

gterzian and others added 2 commits April 26, 2020 18:20

five easy pieces

1302965

gterzian closed this Apr 26, 2020

gterzian force-pushed the master branch from 1302965 to c93950c Compare April 26, 2020 12:16

gterzian mentioned this pull request May 13, 2020

Improve reliability of threading and shutdown servo/servo#26502

Open

matklad mentioned this pull request Feb 11, 2021

CI failure after upgrading actix near/nearcore#3925

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix couple of reliability problems in the first example #1

Fix couple of reliability problems in the first example #1

matklad commented Apr 26, 2020

gterzian commented May 12, 2020 •

edited

Loading

matklad commented May 12, 2020 •

edited

Loading

gterzian commented May 12, 2020 •

edited

Loading

matklad commented May 12, 2020

gterzian commented May 12, 2020 •

edited

Loading

matklad commented May 12, 2020

matklad commented May 12, 2020

gterzian commented May 13, 2020 •

edited

Loading

matklad commented May 13, 2020

Fix couple of reliability problems in the first example #1

Fix couple of reliability problems in the first example #1

Conversation

matklad commented Apr 26, 2020

gterzian commented May 12, 2020 • edited Loading

matklad commented May 12, 2020 • edited Loading

gterzian commented May 12, 2020 • edited Loading

matklad commented May 12, 2020

gterzian commented May 12, 2020 • edited Loading

matklad commented May 12, 2020

matklad commented May 12, 2020

gterzian commented May 13, 2020 • edited Loading

matklad commented May 13, 2020

gterzian commented May 12, 2020 •

edited

Loading

matklad commented May 12, 2020 •

edited

Loading

gterzian commented May 12, 2020 •

edited

Loading

gterzian commented May 12, 2020 •

edited

Loading

gterzian commented May 13, 2020 •

edited

Loading