-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix couple of reliability problems in the first example #1
Conversation
*First*, remove explicit `Exit` messages. They are unnecessary as channel's closure is a perfectly fine exit message in itself. They are harmful, because they add a separate code path (as you need to handle both channel's closure and exit event). They are double harmful, because these different code-paths correspond to normal and abnormal termination. Abnormal termination is never tested in practice, and using a completely separate code path for it is a bad idea, because that's why how complex distributed systems fail: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf This also just makes the code shorter, because we can write a `for` loop. *Second*, make sure that we join worker thread. That makes sure that don't leak the thread outside of the scope of test, and that we also propagate it's panic, if any. Note that the current solution with explicit `.join` is still wrong, because the thread won't be joined on panic, but to handle that more code/custom crate is required. *Third*, unwrap all sends. Send only fails if the receiving side has terminated abormally. In this case, we should escalate failure rather than continue business as usual, lest we create a deadlock/resource leak.
Thanks for this PR, and sorry for having closed it, that was accidental and I think a result of force pushing to the branch.
I disagree. There are cases where you keep a sender around until shut-down. Real world use case: in Servo, many components keeps many senders around(see the constellation for example), sometimes to themselves, in order to pass clones around to new components that are started from time to time. So in those situations, senders will be kept around until shutdown, and instead explicit exit messages are used. I think code can make a difference between the normal "exit" path and the abnormal "the sender was dropped" path, and I think one can test for that too. I actually made a point of including explicit exit messages in the examples, because I think relying on dropping channels only works in some cases, usually pretty small and focused pieces of codes, and that it's not an approach that scales well to larger projects with more complicated workflows(so even thought the example at hand is simple, I'd rather make a point by including an exit workflow that I think is a better idea to follow in general).
I don't see that as a big advantage, because in the real world there will be more than one message variant anyway, so you'll have to match on those instead of using a loop on a single message variant.
Maybe, although in this example I don't think it matters much. I think explicit exit workflows can actually ensure the thread isn't "leaked", and also I could have unwrapped the creation of the thread, but I don't care much about the failure case in this example. I would only do it if the point of the exercise was to show how to handle it, and as you point out fully handling this would require some complexity that is out of the scope of this exercise.
Again, I don't think it matters much, and it's not something that you always want to do in the real world either.
I think this will depend on the case at hand, and it's mostly outside of the scope of the test. And simply not unwrapping a send in and of itself isn't an anti-pattern in my opinion. The constellation in Servo is again an example of a component where we do not want to unwrap on failing sends(although some of those sends are actually over IPC, where failure on the other end is perhaps more recoverable from than if it was in the same process). So whether unwrapping a send makes sense or not, I think depends on the actual use-case. In conclusion, I think some of your changes might be totally valid in some case, and in others they might not be. I don't agree that not following your suggestions amount to anti-patterns in and of itself, and I actually think in the context of the current examples, it would unnecessarily complicate matters or reduce the intended focus of the code example. |
I think we can agree to disagree :-) I haven't studied servo's constellation in detail, but I am willing to bet that it uses channels wrong. On the one hand, this is a stupidly bold statement for me to make, as I don't know use-case well. On the other hand, I repeatedly had the experience of refactoring convoluted exit-message base communication to streamlined drop-based ones, and this experience makes me convinced there's some kind of categorical law of nature here (specifically, that observer is the dual of iterator :P), that the drop-based shape of code is right. This seem to be correlated with complexity of the code in question -- the more complicated the code is, the bigger savings come from getting rid of Exit Messages. My gut feeling that "servo needs exit messages because it is complicated" implication is wrong, I think it's more like "keeping complex codebase simple is very hard, because complexity just accretes with time". Again, I might be wrong here, but my gut feeling is that it's Servo who is wrong :-) Specifically, I think "senders will be kept around until shutdown, and instead explicit exit messages are used" is wrong. I believe a better solution is either of the two:
I don't have enough free time on my hand to dig into a real servo code-base, but I would like to look at the smaller toy example, which has all essential complexity. |
I can agree that a drop based exit workflow can be elegant, and I disagree with the statement that all use of explicit exit workflow would be an anti-pattern in and of itself. Having said that, clean shutdown of Servo still is a major challenge. And I also think this challenge is not a result of people having somehow missed the potential of "just exit when senders drop". This approach is actually followed at times, when appropriate, see for example https://github.com/servo/servo/blob/c5617efff0f77ce79a7827668cf929def2d9337a/components/background_hang_monitor/background_hang_monitor.rs#L246 I think I agree with you that a system where dropping of senders creates a reliable shutdown signal that travels across components is probably optimal, and it's just that in Servo it sometimes doesn't work. There are cases where senders are kept alive, because they need to be cloned and passed to newly spawned components. I guess we could do something that would still allow to rely on dropping for exiting, however in these cases I think it might actually be more complicated then an actual exit message. I can't evaluate your suggestion with those "contexts", although they appear to me as client-side concepts, whereas in Servo it's more like components sometimes keeping senders to themselves alive(not client components keeping those alive, although that might also be the case in some places) and off course you are welcome to take a look and provide more concrete suggestions. |
I think I've seen this pattern in channel's code I was working with, and it indeed took me a while to untangle it. I am not sure what is you case exactly, but my case does sound pretty similar. We have a component, that handles messages. The messages can be either commands from the outside world, or responses from the workers, spawned by the component itself. So, component works by draining a receiver, but it also holds a sender for this receiver, and it clones this sender when it spawns a new worker: use std::thread;
use crossbeam_channel::{unbounded, Sender};
#[derive(Debug)]
enum ComponentMessage {
FromOutside,
FromWorker,
}
struct Component {
sender: Sender<ComponentMessage>,
}
impl Component {
fn start() -> (Sender<ComponentMessage>, thread::JoinHandle<()>) {
let (sender, receiver) = unbounded();
let thread = thread::spawn({
let sender = sender.clone();
move || {
let component = Component {
sender: sender.clone(),
};
component.spawn_worker();
for msg in receiver {
println!("{:?}", msg);
}
}
});
(sender, thread)
}
fn spawn_worker(&self) {
let _thread = thread::spawn({
let sender = self.sender.clone();
move || {
for _ in 0..10 {
sender.send(ComponentMessage::FromWorker).unwrap()
}
}
});
}
}
fn main() {
let (sender, handle) = Component::start();
sender.send(ComponentMessage::FromOutside).unwrap();
drop(sender);
println!("joining ...");
handle.join().unwrap();
println!("... done!"); // nope, will deadlock :-(
} The problem I had with this system is that shutdown was tricky. Basically, the whole system is a giant feedback loop, because receiving a message from the main channel can, indirectly push new messages into the receiver. Even if you close the incoming from the outside bit of channel, you still have an internal cycle between the component and the workers, and this cycle can pump messages indefinitely. You can forcibly stop the feedback loop by just cutting the cycle at arbitrary point, but that means that in-progress messages would be dropped on the floor, which might be OK, but won't be elegant. After I realized that the core problem with shutdown is this "feedback loop" effect, I've figured a way to structure the code so that the loop does not arise in the first place. The idea is to use separate channels for communication with the outside world and for communication with the workers, and use std::thread;
use crossbeam_channel::{select, unbounded, Sender};
#[derive(Debug)]
struct FromOutside;
// Note, these are now private
#[derive(Debug)]
struct FromWorker;
#[derive(Debug)]
enum ComponentMessage {
FromOutside(FromOutside),
FromWorker(FromWorker),
}
struct Component {
sender: Sender<FromWorker>,
}
impl Component {
fn start() -> (Sender<FromOutside>, thread::JoinHandle<()>) {
let (sender, receiver) = unbounded();
let thread = thread::spawn({
move || {
let (worker_sender, worker_receiver) = unbounded();
let component = Component {
sender: worker_sender,
};
component.spawn_worker();
loop {
let msg = select! {
recv(worker_receiver) -> msg => {
ComponentMessage::FromWorker(msg.unwrap())
}
recv(receiver) -> msg => {
match msg {
Ok(msg) => ComponentMessage::FromOutside(msg),
Err(_) => {
drop(receiver);
break
},
}
}
};
println!("{:?}", msg);
}
drop(component);
for msg in worker_receiver {
eprintln!("{:?}", msg)
}
}
});
(sender, thread)
}
fn spawn_worker(&self) {
let _thread = thread::spawn({
let sender = self.sender.clone();
move || {
for _ in 0..10 {
sender.send(FromWorker).unwrap()
}
}
});
}
}
fn main() {
let (sender, handle) = Component::start();
sender.send(FromOutside).unwrap();
drop(sender);
println!("joining ...");
handle.join().unwrap();
println!("... done!")
} The drop-based cancellation here is the side effect, the core idea is that feedback loops are really bad for shutdown, and that you can refactor them away by splitting a common channel into two. |
Yes I think that's a good design, and as you know, you're not actually shutting down in response to the drop, rather you're not handling further messages from "outside", but you're still draining the workers channel. I actually think this shows how business logic gets complicated pretty quickly, and I might actually be in favor of explicit exit messages for that reason alone. For example, what if you have two exit modes, one in which you want to still drain and handle the worker messages, and one where you don't(maybe draining the workers almost always makes sense, but then you could some other distinction between exit modes)? The way to model that, in my opinion, would be with two separate exit messages. So I think handling the I think you're totally right that it's usually when a channel is shared between the outside and inside world that you the sender is kept alive. I've just had a look in Servo for example, and the And this is cloned and passed to various entities you could call "workers" of that component, for example at https://github.com/servo/servo/blob/44ac62ddda3652e0f6c79bb4dc77ef389645cf37/components/script/script_thread.rs#L2385 Actually the doc of the actual port sums it up nicely with:
So yes the problem there is the "constellation and layout threads" part, it would be more consistent to have two separate channels for that, since it's only the constellation that can send the actual exit message. And I think this is an argument for clearly separating concerns with different channels, not necessarily an argument for always using an There's another one at https://github.com/servo/servo/blob/44ac62ddda3652e0f6c79bb4dc77ef389645cf37/components/compositing/compositor.rs#L397 This time, instead of sending the Exit message, and then setting the state to pub enum ShutdownState {
NotShuttingDown(Sender<ConstellationMsg>),
ShuttingDown,
FinishedShuttingDown,
} So that when we switch the state to However in both cases, are we really talking about "anti-patterns" when using the exit messages? I'm not so sure, I might tend in favor of explicit messages, just to make a difference between "the compositor went away" and "we're starting to shut-down". Also, in your second example above, when receiving a worker messages you always unwrap it, so clearly you don't see an To me, both approaches are really quite similar, and when having lots of channels on all sides I might favor explicit messages just to make it easier for anyone to understand. I don't think that's an anti-pattern in itself that necessarily leads to weird deadlock type of situations like you hinted at in your first example. I do agree that better separating the various channels is better in all cases(and that it does usually make it possible to enable dropping channels when shutting down). By the way I do think the joining might be a good idea to do more systematically in Servo. For example if you look at when a BHM is started at https://github.com/servo/servo/blob/44ac62ddda3652e0f6c79bb4dc77ef389645cf37/components/background_hang_monitor/background_hang_monitor.rs#L26 So this call actually returns another trait, one which the caller can store and later use to register stuff for monitoring. The implementation of the trait contains the sender, and I guess it could also contain the join handle, and then we dropping it could drop the sender, and then join on the worker thread. We should do this more often I think. It might solve some issues with some threads still running post shutdown. |
I agree that the above examples are more about structure of communication, rather than shutdown mechanism, but I do think they are relevant. More on it later :-) I don't buy this argument:
This same reasoning can be applied to the statement: "for all resources, it's better to provide explicit With general resources, there are cases where explicit close is better, specifically, if it is actually I want to argue that drop-based cancellation is a much better default, on the grounds that it throws you into a pit of success. This is why my two above examples are relevant -- with exit messages, you can fix both of them. With drop, you can only write the second example, the first one just won't work. Drop-based cancellation works only in cases where your communication structure is a tree, while exit messages work with arbitrary graphs. That means that, if you go with drop-only solution, you guarantee that all components form a tree, which can be trivially teared down. With exit messages, you can create arbitrary complex structures. This is just the usual Rust magic, applied to communication. The default architecture of GC application is "a sea of objects", the default architecture of Rust program is "tree of ownership". Arbitrary graph of objects is certainly more powerful, and you can cook it right, if you try hard enough. But most Rust programmers agree that restricting yourself to trees does not really reduce necessary expressiveness most of the time, at the cost of simplifying the software considerably :-) A specific example of the pit of success is how switching to drop just magically made the second step as good as the third one. This was a complete surprise to me, I didn't planed for this, I just went to the third step after the second one and realized "oups, I've already done this". Finally, I still maintain that drop has significant meaningful benefits for abnormal shutdown. If you have |
So, the let (worker_sender, worker_sender) = ...;
let outside_receiver = Some(&outside_receiver);
let worker_receiver = Some(&worker_receiver);
while outside_receiver.is_some() || worker_receiver.is_some() {
select! {
recv(outside_receiver.unwrap_or(&never())) -> msg => match msg {
Ok(it) => { ... },
Err(_) => { worker_sender = None; continue; }
}
recv(worker_receiver.unwrap_or(&never)) -> msg => match msg {
Ok(it) => { ... },
Err(_) => { continue; }
}
}
} |
Yes I think I can actually agree that your approach is probably better. When I use exit signals, I might be doing something explicitly that can indeed be modeled implicitly using the channels themselves, and this indeed can prevent a whole class of errors in the business logic. I tried to come-up with an example where this would fail, by adding certain constraints to the example, yet so far it's actually surprising how indeed the dropping channel makes it seemingly impossible to get it wrong. I used to be a big fan of this drop based approach, and then I somehow moved more towards explicit exit workflows, perhaps thinking those would scale better for complicated flows, however the drop-based approach actually appears to allow to arbitrary complex workflows, and actually make them more robust. I'll be looking more at the drop-based approach when designing new workflows going forward. I'll also try to incorporate joining on the threads at shutdown, and I guess unwrapping all sends, when combined with the other two, actually makes a lot of sense too. Thanks for persistently bringing this to my attention. Not sure if you want to re-open this and rebase your changes so it can be merged? Otherwise I can just add a link to this discussion on the readme of the repo, so people can see the different approach. In any case, as I write further article, they will probably highlight the approach you advocated here... |
Thanks for listening! I would be very curious to hear how this ideas work in the context of servo. I've tried rebasing, but there were some non-trivial to fix conflicts, so I abandoned that plan :) |
First, remove explicit
Exit
messages. They are unnecessary aschannel's closure is a perfectly fine exit message in itself. They are
harmful, because they add a separate code path (as you need to handle
both channel's closure and exit event). They are double harmful,
because these different code-paths correspond to normal and abnormal
termination. Abnormal termination is never tested in practice, and
using a completely separate code path for it is a bad idea, because
that's why how complex distributed systems fail:
https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf
This also just makes the code shorter, because we can write a
for
loop.Second, make sure that we join worker thread. That makes sure that
don't leak the thread outside of the scope of test, and that we also
propagate it's panic, if any. Note that the current solution with
explicit
.join
is still wrong, because the thread won't be joined onpanic, but to handle that more code/custom crate is required.
Third, unwrap all sends. Send only fails if the receiving side has
terminated abormally. In this case, we should escalate failure rather
than continue business as usual, lest we create a deadlock/resource
leak.