Fix halfway-stopping of listeners #348

juhlig · 2024-06-25T08:51:59Z

Fixes #347.

The changes in this PR actually do two things:

by executing the stopping procedure in a separate process, a crash of the process calling ranch:stop_listener at an unfortunate time (namely, between the supervisor:terminate_child and supervisor:delete_child calls involved) should no longer prevent the stopping procedure from going to completion
by separating the fetching of the transport and transport options from the actual stopping, fully stopping a for some reason terminated by not deleted listener should complete

There are no tests yet (and for the former case, I don't see how we could reliably create the scenario of the calling process crashing at just the right time). I want to hear what you think about this approach first before investing more time.

zuiderkwast · 2024-06-25T09:03:08Z

Hi Jan!

In #347, @JeppeTh outlined a test case as an Erlang shell session. It involves supervisor:terminate_child to simulate the problem and put ranch in the problematic state.

juhlig · 2024-06-25T09:13:55Z

Hi Jan!

Hi Viktor, long time no see 😅

In #347, @JeppeTh outlined a test case as an Erlang shell session. It involves supervisor:terminate_child to simulate the problem and put ranch in the problematic state.

Yes, that would be a blueprint for a test case for my second point, when a listener already is in the halfway-stopped state.

The hard-to-test part revolves around my first point, which should make it even impossible to put a listener in that state by "normal means" and crash at unfortunate times =^^=

juhlig · 2024-06-27T14:59:45Z

@essen? WDYT of the approach I used and the PR in general?

essen · 2024-06-28T11:01:12Z

Why isn't this done in ranch_server instead?

juhlig · 2024-07-02T15:39:52Z

Why isn't this done in ranch_server instead?

Sure, could be done, not a bad idea :)
It would be a first, though: up to now ranch_server only does some bookkeeping on the listeners, ie it doesn't really do anything.

I'll make a draft to show to you, maybe tomorrow.

juhlig · 2024-07-02T16:29:21Z

I'll make a draft to show to you, maybe tomorrow.

... or maybe today =^^= WDYT?

essen · 2024-07-02T17:20:52Z

I would not move the stop logic to ranch_server but would instead add a way for it to execute a callback and return whatever the result is. That way we don't need to worry about the details in ranch_server and it can be reused in the future if necessary. But mostly to keep the stop code all in one place in ranch really.

Something like gen_server:call({rpc, M, F, A}). To be honest I wonder if erpc or something else can be used instead.

juhlig · 2024-07-03T09:04:39Z

Ah, ok, so you mean leave the stopping code in the ranch module, but run it in the ranch_server process, instead of in a process spawned from the process calling stop_listener, correct?

The way you propose would enable anyone to run any code in the ranch_server process, not sure if that is a good thing, it gives me an uneasy feeling though 🤷‍♂️
At the least, it would enable crashing the ranch_server process (like {rpc, erlang, exit, [foo]}). We could wrap the apply in a try...catch to defend against that.
The given MFArgs could also stall ranch_server (like {rpc, timer, sleep, [infinity]}). We could run the given MFArgs in a spawned process to defend against that, but then we could just as well spawn from the process calling ranch:stop_listener?
(Back in the good old days of IRC, I remember you saying "if users want to shoot themselves in the foot we let them" or sth like that, so I'm just pointing it out.)

erpc, by which I think you mean specifically erpc:call, hm. What it in a nutshell does is a fancy spawn, that would make it a viable alternative to the ranch_server rpc thing you proposed.
On closer inspection however, erpc:call takes a shortcut when called with the local node as first argument, in that case it does an apply instead, running the given MFArgs in the context of the calling process, which leaves us with the same problem as we have now: if the calling process exits at an awkward point (ie, between the applied MFArgs calling terminate_child and delete_child), we end up in the same scenario of a halfway-stopped listener.
(I think this behavior can actually be called a bug, I'll probably create an issue with OTP about it later.)

essen · 2024-07-03T09:21:16Z

Ah, ok, so you mean leave the stopping code in the ranch module, but run it in the ranch_server process, instead of in a process spawned from the process calling stop_listener, correct?

Yes.

The way you propose would enable anyone to run any code in the ranch_server process, not sure if that is a good thing, it gives me an uneasy feeling though 🤷‍♂️

We can restrict the M and F to ranch and do_stop_listener or whatever the name is, for now.

zuiderkwast · 2024-07-03T12:32:05Z

@essen What's the point of this extra abstraction? To me it looks like over-engineering.

Why not just do the simple protection (or workaround if consider it an OTP bug) for the fact that supervisor:terminate_child and supervisor:delete_child can't be called together as an atomic unit?

essen · 2024-07-03T15:47:19Z

We need to do it in a process that won't be the caller, it can either be a random process that we spawn for this specifically, or ranch_server which already exists. Better it being the latter as it's easy to find, trace, debug and whatever else, and we don't have to reimplement a call mechanism + monitors etc. like we would for spawn. It's not fire and forget, we have to wait for the listener to be stopped before returning.

Not sure which part is over engineering.

essen · 2024-07-03T16:10:28Z

If erpc applies its optimisation only for infinity timeout then it could be used then? Maybe have a large timeout, larger than the calls stop_listener do.

zuiderkwast · 2024-07-03T19:19:21Z

Thanks, makes sense.

src/ranch.erl

essen · 2024-07-04T14:33:35Z

Looks good, thanks! I'm off for the next two weeks but I'll look into merging and releasing a new version when I get back.

juhlig · 2024-07-04T14:44:24Z

Looks good, thanks! I'm off for the next two weeks but I'll look into merging and releasing a new version when I get back.

Have a nice vacation then 👋

juhlig · 2024-10-21T08:42:50Z

@essen what about this PR? 3+ months have passed 😅

essen · 2024-11-06T12:28:42Z

I have updated the CI, please force push to run CI again.

juhlig · 2024-11-07T09:02:44Z

@essen Hm. Looks like we have a race condition there now. It should be of little to no consequence in reaility, but it messes up the tests.

With the rewrite of stop_listener, we removed the explicit call to ranch_server:cleanup_listener_opts, because the exit of the listener sup will be noticed by ranch_server via the monitor, and it will then do the cleanup automatically.

At the end of most tests, we call ranch:stop_listener and then check the exit by a call to ranch:get_port, which in essence will try to fetch the record from ranch_server, and expect that it fails. However, if ranch:get_port is called before the 'DOWN' message from the monitor reached ranch_server, the record is still there, and the ranch:get_port call will unexpectedly succeed.

So, I could put back the explicit ranch_server:cleanup_listener_opts call, that would fix the tests immediately. But I'd rather not, instead just let the monitor do its stuff. To fix up the tests for that, instead of the immediate check for non-existence of the record, we would have to put a mechanism in place that retries a few times (ie, if the record is still there, wait like 100ms and try again, it should disappear anytime soon).

WDYT?

essen · 2024-11-07T09:47:53Z

Can we make ranch:stop_listener not return until it has all been processed instead?

juhlig · 2024-11-07T10:22:25Z

Can we make ranch:stop_listener not return until it has all been processed instead?

Maybe... This could be done in 3 ways, as I see it:

call ranch_server:cleanup_listener_opts explicitly in the stop routine, as before (easiest)
put a check-and-retry loop, as described above, in the stop routine (kinda ugly)
let ranch_server do all the stopping, ie terminating and deleting the listener followed by the option cleanup and the transport cleanup (kinda complex); the transport cleanup is a bit problematic, in ranch_tcp and ranch_ssl they only remove the socket file if it was a local socket, but custom transports may do more, during which time ranch_server would be blocked?

essen · 2024-11-07T11:01:33Z

I would do the following:

Update the tests so that we just stop the listener with nothing after
Create a test (if we don't have one already) that starts and stops the same listener in a loop (to make sure this late cleanup doesn't mess things up; if it does then we need stop_listener to wait before it returns)
Create a test that will stop the listener and check that the data is eventually gone (cleanup happened), perhaps with a wait limit of 5s (large enough for CI)

Then depending on what the tests tell us we may need to do more.

juhlig · 2024-11-07T11:41:44Z

Looking at the code, the late cleanup is likely problematic in a tight stop/start loop.

stop_listener is called
the listener sup is terminated and deleted
start_listener is called
as the listener sup is gone, a new one will start; it will try to set new listener options in ranch_server, but since this is done via insert_new, the old ones (silently) remain
the 'DOWN' message caused by the listener sup exit reaches and is processed by ranch_server, which will remove the options
--> we have a running listener which is not present in ranch_server, or only partially present, if the cleanup is intermixed with set_listener_sup and other such calls which happen at different point during listener startup

An explicit call to cleanup_listener_opts also contains a race condition where, if stop_listener and start_listener are called from different processes at about the same time, we could end up with a running listener but no options in ranch_server.

process A calls stop_listener
the listener sup is terminated and deleted
the 'DOWN' message reaches ranch_server which removes the options
process B calls start_listener
the listener is started, and the options are set in ranch_server
process A gets to the point where it calls cleanup_listener_opts as part of the stop routine
the options get removed from ranch_server
--> we have a running listener with no options in ranch_server

A check-wait-retry loop also contains a race condition where, if stop_listener and start_listener are called from different processes at around the same time, we could end up with the stop_listener call never returning.

process A calls stop_listener
the listener sup is terminated and deleted
it checks with get_port (or sth) to see if the options were removed; as the 'DOWN' message has not reached ranch_server yet, it waits for ~100ms before trying again
the 'DOWN' message reaches ranch_server which removes the options
process B calls start_listener
the listener is started, and the options are set in ranch_server
process A finishes waiting and checks again if the options were removed; they were, but then put back when process B started the listener again, so it again finds options and again waits, and then again and again etc
--> stop_listener never returns

🤷‍♂️

essen · 2024-11-07T11:52:38Z

It's OK that there are race conditions when start and stop are called by different processes. I don't think we should support more than "stop then start" from within the same process. It's up to the user to synchronize if they need more.

essen · 2024-11-07T11:53:24Z

If the explicit cleanup allows tight stop/start loops to work, then let's go with this.

juhlig · 2024-11-07T12:09:35Z

Have just been thinking... if we explicitly call cleanup_listener_opts after stopping but before deleting the listener sup, we should be fine. As long as the listener sup is "there", another one with the same id can't be started, because {error, already_present}. Am I missing something?

essen · 2024-11-07T12:14:57Z

Not sure if something can go wrong when stopping. But yes we could swap the order we do things if it makes things better.

juhlig · 2024-11-07T12:25:47Z

Pushed a new commit with that.

I don't think anything much could go wrong. For one, we are running the stop procedure in an isolated process now, so whatever happens to the process calling stop_listener is already out of the picture. The rest, if something goes wrong in there, either the listener sup is not stopped at all (like, if supervisor:terminate_child somehow fails), or we are left with a stopped listener sup which was not removed (but the options will be cleaned up, either by the explicit call or via the monitor).

essen · 2024-11-07T12:30:12Z

The start/stop loop test I mentioned would also be good to have if we don't already have it.

juhlig · 2024-11-07T12:32:07Z

The start/stop loop test I mentioned would also be good to have if we don't already have it.

I'll have to see to that later =^^= Which means, probably next week if I can't get back to it today.

juhlig · 2024-11-07T14:04:45Z

There :)

essen · 2024-11-08T13:46:15Z

I think it looks good. Please rebase and squash and we can merge. If there are hard macOS failures in CI when you do this it's because the runner image is being changed and this makes the cached OTP build invalid for that environment.

* if the process calling ranch:stop_listener crashes before finishing, the stopping procedure is still executed completely * if a listener is terminated but not deleted, calling ranch:stop_listener removes the remnant

juhlig · 2024-11-09T19:29:16Z

Done 👍

juhlig · 2024-11-09T19:39:30Z

There is another failure now, macOS again, not sure what this one is about. Seems unrelated to changes in this PR, anyway 🤷‍♂️

essen · 2024-11-12T14:14:30Z

Merged, thanks!

juhlig mentioned this pull request Jun 25, 2024

Restart issues found when upgrading 1.8 -> 2.1.0 #347

Closed

juhlig force-pushed the fix_half-stopped_listeners branch from 1c3b4f7 to de6d4b6 Compare July 4, 2024 07:45

essen reviewed Jul 4, 2024

View reviewed changes

src/ranch.erl Show resolved Hide resolved

essen reviewed Jul 4, 2024

View reviewed changes

src/ranch.erl Outdated Show resolved Hide resolved

juhlig force-pushed the fix_half-stopped_listeners branch from de6d4b6 to a2c959e Compare July 4, 2024 08:35

essen reviewed Jul 4, 2024

View reviewed changes

src/ranch.erl Show resolved Hide resolved

juhlig force-pushed the fix_half-stopped_listeners branch 2 times, most recently from b90878a to 70e0dc4 Compare November 7, 2024 08:42

Fix halfway-stopping of listeners

80e012e

* if the process calling ranch:stop_listener crashes before finishing, the stopping procedure is still executed completely * if a listener is terminated but not deleted, calling ranch:stop_listener removes the remnant

juhlig force-pushed the fix_half-stopped_listeners branch from 791e095 to 80e012e Compare November 9, 2024 19:28

essen closed this Nov 12, 2024

juhlig mentioned this pull request Nov 18, 2024

Raise minimum OTP release to 24+ #353

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix halfway-stopping of listeners #348

Fix halfway-stopping of listeners #348

juhlig commented Jun 25, 2024

zuiderkwast commented Jun 25, 2024

juhlig commented Jun 25, 2024 •

edited

Loading

juhlig commented Jun 27, 2024

essen commented Jun 28, 2024

juhlig commented Jul 2, 2024

juhlig commented Jul 2, 2024

essen commented Jul 2, 2024

juhlig commented Jul 3, 2024

essen commented Jul 3, 2024

zuiderkwast commented Jul 3, 2024

essen commented Jul 3, 2024

essen commented Jul 3, 2024

zuiderkwast commented Jul 3, 2024

essen commented Jul 4, 2024

juhlig commented Jul 4, 2024

juhlig commented Oct 21, 2024

essen commented Nov 6, 2024

juhlig commented Nov 7, 2024 •

edited

Loading

essen commented Nov 7, 2024

juhlig commented Nov 7, 2024

essen commented Nov 7, 2024

juhlig commented Nov 7, 2024

essen commented Nov 7, 2024

essen commented Nov 7, 2024

juhlig commented Nov 7, 2024 •

edited

Loading

essen commented Nov 7, 2024

juhlig commented Nov 7, 2024

essen commented Nov 7, 2024

juhlig commented Nov 7, 2024

juhlig commented Nov 7, 2024

essen commented Nov 8, 2024

juhlig commented Nov 9, 2024

juhlig commented Nov 9, 2024

essen commented Nov 12, 2024

Fix halfway-stopping of listeners #348

Fix halfway-stopping of listeners #348

Conversation

juhlig commented Jun 25, 2024

zuiderkwast commented Jun 25, 2024

juhlig commented Jun 25, 2024 • edited Loading

juhlig commented Jun 27, 2024

essen commented Jun 28, 2024

juhlig commented Jul 2, 2024

juhlig commented Jul 2, 2024

essen commented Jul 2, 2024

juhlig commented Jul 3, 2024

essen commented Jul 3, 2024

zuiderkwast commented Jul 3, 2024

essen commented Jul 3, 2024

essen commented Jul 3, 2024

zuiderkwast commented Jul 3, 2024

essen commented Jul 4, 2024

juhlig commented Jul 4, 2024

juhlig commented Oct 21, 2024

essen commented Nov 6, 2024

juhlig commented Nov 7, 2024 • edited Loading

essen commented Nov 7, 2024

juhlig commented Nov 7, 2024

essen commented Nov 7, 2024

juhlig commented Nov 7, 2024

essen commented Nov 7, 2024

essen commented Nov 7, 2024

juhlig commented Nov 7, 2024 • edited Loading

essen commented Nov 7, 2024

juhlig commented Nov 7, 2024

essen commented Nov 7, 2024

juhlig commented Nov 7, 2024

juhlig commented Nov 7, 2024

essen commented Nov 8, 2024

juhlig commented Nov 9, 2024

juhlig commented Nov 9, 2024

essen commented Nov 12, 2024

juhlig commented Jun 25, 2024 •

edited

Loading

juhlig commented Nov 7, 2024 •

edited

Loading

juhlig commented Nov 7, 2024 •

edited

Loading