Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix halfway-stopping of listeners #348

Closed
wants to merge 1 commit into from

Conversation

juhlig
Copy link
Contributor

@juhlig juhlig commented Jun 25, 2024

Fixes #347.

The changes in this PR actually do two things:

  • by executing the stopping procedure in a separate process, a crash of the process calling ranch:stop_listener at an unfortunate time (namely, between the supervisor:terminate_child and supervisor:delete_child calls involved) should no longer prevent the stopping procedure from going to completion
  • by separating the fetching of the transport and transport options from the actual stopping, fully stopping a for some reason terminated by not deleted listener should complete

There are no tests yet (and for the former case, I don't see how we could reliably create the scenario of the calling process crashing at just the right time). I want to hear what you think about this approach first before investing more time.

@zuiderkwast
Copy link
Contributor

Hi Jan!

In #347, @JeppeTh outlined a test case as an Erlang shell session. It involves supervisor:terminate_child to simulate the problem and put ranch in the problematic state.

@juhlig
Copy link
Contributor Author

juhlig commented Jun 25, 2024

Hi Jan!

Hi Viktor, long time no see 😅

In #347, @JeppeTh outlined a test case as an Erlang shell session. It involves supervisor:terminate_child to simulate the problem and put ranch in the problematic state.

Yes, that would be a blueprint for a test case for my second point, when a listener already is in the halfway-stopped state.

The hard-to-test part revolves around my first point, which should make it even impossible to put a listener in that state by "normal means" and crash at unfortunate times =^^=

@juhlig
Copy link
Contributor Author

juhlig commented Jun 27, 2024

@essen? WDYT of the approach I used and the PR in general?

@essen
Copy link
Member

essen commented Jun 28, 2024

Why isn't this done in ranch_server instead?

@juhlig
Copy link
Contributor Author

juhlig commented Jul 2, 2024

Why isn't this done in ranch_server instead?

Sure, could be done, not a bad idea :)
It would be a first, though: up to now ranch_server only does some bookkeeping on the listeners, ie it doesn't really do anything.

I'll make a draft to show to you, maybe tomorrow.

@juhlig
Copy link
Contributor Author

juhlig commented Jul 2, 2024

I'll make a draft to show to you, maybe tomorrow.

... or maybe today =^^= WDYT?

@essen
Copy link
Member

essen commented Jul 2, 2024

I would not move the stop logic to ranch_server but would instead add a way for it to execute a callback and return whatever the result is. That way we don't need to worry about the details in ranch_server and it can be reused in the future if necessary. But mostly to keep the stop code all in one place in ranch really.

Something like gen_server:call({rpc, M, F, A}). To be honest I wonder if erpc or something else can be used instead.

@juhlig
Copy link
Contributor Author

juhlig commented Jul 3, 2024

Ah, ok, so you mean leave the stopping code in the ranch module, but run it in the ranch_server process, instead of in a process spawned from the process calling stop_listener, correct?

The way you propose would enable anyone to run any code in the ranch_server process, not sure if that is a good thing, it gives me an uneasy feeling though 🤷‍♂️
At the least, it would enable crashing the ranch_server process (like {rpc, erlang, exit, [foo]}). We could wrap the apply in a try...catch to defend against that.
The given MFArgs could also stall ranch_server (like {rpc, timer, sleep, [infinity]}). We could run the given MFArgs in a spawned process to defend against that, but then we could just as well spawn from the process calling ranch:stop_listener?
(Back in the good old days of IRC, I remember you saying "if users want to shoot themselves in the foot we let them" or sth like that, so I'm just pointing it out.)

erpc, by which I think you mean specifically erpc:call, hm. What it in a nutshell does is a fancy spawn, that would make it a viable alternative to the ranch_server rpc thing you proposed.
On closer inspection however, erpc:call takes a shortcut when called with the local node as first argument, in that case it does an apply instead, running the given MFArgs in the context of the calling process, which leaves us with the same problem as we have now: if the calling process exits at an awkward point (ie, between the applied MFArgs calling terminate_child and delete_child), we end up in the same scenario of a halfway-stopped listener.
(I think this behavior can actually be called a bug, I'll probably create an issue with OTP about it later.)

@essen
Copy link
Member

essen commented Jul 3, 2024

Ah, ok, so you mean leave the stopping code in the ranch module, but run it in the ranch_server process, instead of in a process spawned from the process calling stop_listener, correct?

Yes.

The way you propose would enable anyone to run any code in the ranch_server process, not sure if that is a good thing, it gives me an uneasy feeling though 🤷‍♂️

We can restrict the M and F to ranch and do_stop_listener or whatever the name is, for now.

@zuiderkwast
Copy link
Contributor

@essen What's the point of this extra abstraction? To me it looks like over-engineering.

Why not just do the simple protection (or workaround if consider it an OTP bug) for the fact that supervisor:terminate_child and supervisor:delete_child can't be called together as an atomic unit?

@essen
Copy link
Member

essen commented Jul 3, 2024

We need to do it in a process that won't be the caller, it can either be a random process that we spawn for this specifically, or ranch_server which already exists. Better it being the latter as it's easy to find, trace, debug and whatever else, and we don't have to reimplement a call mechanism + monitors etc. like we would for spawn. It's not fire and forget, we have to wait for the listener to be stopped before returning.

Not sure which part is over engineering.

@essen
Copy link
Member

essen commented Jul 3, 2024

If erpc applies its optimisation only for infinity timeout then it could be used then? Maybe have a large timeout, larger than the calls stop_listener do.

@zuiderkwast
Copy link
Contributor

Thanks, makes sense.

@juhlig juhlig force-pushed the fix_half-stopped_listeners branch from 1c3b4f7 to de6d4b6 Compare July 4, 2024 07:45
src/ranch.erl Show resolved Hide resolved
src/ranch.erl Outdated Show resolved Hide resolved
@juhlig juhlig force-pushed the fix_half-stopped_listeners branch from de6d4b6 to a2c959e Compare July 4, 2024 08:35
src/ranch.erl Show resolved Hide resolved
@essen
Copy link
Member

essen commented Jul 4, 2024

Looks good, thanks! I'm off for the next two weeks but I'll look into merging and releasing a new version when I get back.

@juhlig
Copy link
Contributor Author

juhlig commented Jul 4, 2024

Looks good, thanks! I'm off for the next two weeks but I'll look into merging and releasing a new version when I get back.

Have a nice vacation then 👋

@juhlig
Copy link
Contributor Author

juhlig commented Oct 21, 2024

@essen what about this PR? 3+ months have passed 😅

@essen
Copy link
Member

essen commented Nov 6, 2024

I have updated the CI, please force push to run CI again.

@juhlig juhlig force-pushed the fix_half-stopped_listeners branch 2 times, most recently from b90878a to 70e0dc4 Compare November 7, 2024 08:42
@juhlig
Copy link
Contributor Author

juhlig commented Nov 7, 2024

@essen Hm. Looks like we have a race condition there now. It should be of little to no consequence in reaility, but it messes up the tests.

With the rewrite of stop_listener, we removed the explicit call to ranch_server:cleanup_listener_opts, because the exit of the listener sup will be noticed by ranch_server via the monitor, and it will then do the cleanup automatically.

At the end of most tests, we call ranch:stop_listener and then check the exit by a call to ranch:get_port, which in essence will try to fetch the record from ranch_server, and expect that it fails. However, if ranch:get_port is called before the 'DOWN' message from the monitor reached ranch_server, the record is still there, and the ranch:get_port call will unexpectedly succeed.

So, I could put back the explicit ranch_server:cleanup_listener_opts call, that would fix the tests immediately. But I'd rather not, instead just let the monitor do its stuff. To fix up the tests for that, instead of the immediate check for non-existence of the record, we would have to put a mechanism in place that retries a few times (ie, if the record is still there, wait like 100ms and try again, it should disappear anytime soon).

WDYT?

@essen
Copy link
Member

essen commented Nov 7, 2024

Can we make ranch:stop_listener not return until it has all been processed instead?

@juhlig
Copy link
Contributor Author

juhlig commented Nov 7, 2024

Can we make ranch:stop_listener not return until it has all been processed instead?

Maybe... This could be done in 3 ways, as I see it:

  • call ranch_server:cleanup_listener_opts explicitly in the stop routine, as before (easiest)
  • put a check-and-retry loop, as described above, in the stop routine (kinda ugly)
  • let ranch_server do all the stopping, ie terminating and deleting the listener followed by the option cleanup and the transport cleanup (kinda complex); the transport cleanup is a bit problematic, in ranch_tcp and ranch_ssl they only remove the socket file if it was a local socket, but custom transports may do more, during which time ranch_server would be blocked?

@essen
Copy link
Member

essen commented Nov 7, 2024

I would do the following:

  • Update the tests so that we just stop the listener with nothing after
  • Create a test (if we don't have one already) that starts and stops the same listener in a loop (to make sure this late cleanup doesn't mess things up; if it does then we need stop_listener to wait before it returns)
  • Create a test that will stop the listener and check that the data is eventually gone (cleanup happened), perhaps with a wait limit of 5s (large enough for CI)

Then depending on what the tests tell us we may need to do more.

@juhlig
Copy link
Contributor Author

juhlig commented Nov 7, 2024

Looking at the code, the late cleanup is likely problematic in a tight stop/start loop.

  • stop_listener is called
  • the listener sup is terminated and deleted
  • start_listener is called
  • as the listener sup is gone, a new one will start; it will try to set new listener options in ranch_server, but since this is done via insert_new, the old ones (silently) remain
  • the 'DOWN' message caused by the listener sup exit reaches and is processed by ranch_server, which will remove the options
  • --> we have a running listener which is not present in ranch_server, or only partially present, if the cleanup is intermixed with set_listener_sup and other such calls which happen at different point during listener startup

An explicit call to cleanup_listener_opts also contains a race condition where, if stop_listener and start_listener are called from different processes at about the same time, we could end up with a running listener but no options in ranch_server.

  • process A calls stop_listener
  • the listener sup is terminated and deleted
  • the 'DOWN' message reaches ranch_server which removes the options
  • process B calls start_listener
  • the listener is started, and the options are set in ranch_server
  • process A gets to the point where it calls cleanup_listener_opts as part of the stop routine
  • the options get removed from ranch_server
  • --> we have a running listener with no options in ranch_server

A check-wait-retry loop also contains a race condition where, if stop_listener and start_listener are called from different processes at around the same time, we could end up with the stop_listener call never returning.

  • process A calls stop_listener
  • the listener sup is terminated and deleted
  • it checks with get_port (or sth) to see if the options were removed; as the 'DOWN' message has not reached ranch_server yet, it waits for ~100ms before trying again
  • the 'DOWN' message reaches ranch_server which removes the options
  • process B calls start_listener
  • the listener is started, and the options are set in ranch_server
  • process A finishes waiting and checks again if the options were removed; they were, but then put back when process B started the listener again, so it again finds options and again waits, and then again and again etc
  • --> stop_listener never returns

🤷‍♂️

@essen
Copy link
Member

essen commented Nov 7, 2024

It's OK that there are race conditions when start and stop are called by different processes. I don't think we should support more than "stop then start" from within the same process. It's up to the user to synchronize if they need more.

@essen
Copy link
Member

essen commented Nov 7, 2024

If the explicit cleanup allows tight stop/start loops to work, then let's go with this.

@juhlig
Copy link
Contributor Author

juhlig commented Nov 7, 2024

Have just been thinking... if we explicitly call cleanup_listener_opts after stopping but before deleting the listener sup, we should be fine. As long as the listener sup is "there", another one with the same id can't be started, because {error, already_present}. Am I missing something?

@essen
Copy link
Member

essen commented Nov 7, 2024

Not sure if something can go wrong when stopping. But yes we could swap the order we do things if it makes things better.

@juhlig
Copy link
Contributor Author

juhlig commented Nov 7, 2024

Pushed a new commit with that.

I don't think anything much could go wrong. For one, we are running the stop procedure in an isolated process now, so whatever happens to the process calling stop_listener is already out of the picture. The rest, if something goes wrong in there, either the listener sup is not stopped at all (like, if supervisor:terminate_child somehow fails), or we are left with a stopped listener sup which was not removed (but the options will be cleaned up, either by the explicit call or via the monitor).

@essen
Copy link
Member

essen commented Nov 7, 2024

The start/stop loop test I mentioned would also be good to have if we don't already have it.

@juhlig
Copy link
Contributor Author

juhlig commented Nov 7, 2024

The start/stop loop test I mentioned would also be good to have if we don't already have it.

I'll have to see to that later =^^= Which means, probably next week if I can't get back to it today.

@juhlig
Copy link
Contributor Author

juhlig commented Nov 7, 2024

There :)

@essen
Copy link
Member

essen commented Nov 8, 2024

I think it looks good. Please rebase and squash and we can merge. If there are hard macOS failures in CI when you do this it's because the runner image is being changed and this makes the cached OTP build invalid for that environment.

* if the process calling ranch:stop_listener crashes before finishing,
  the stopping procedure is still executed completely
* if a listener is terminated but not deleted, calling ranch:stop_listener
  removes the remnant
@juhlig
Copy link
Contributor Author

juhlig commented Nov 9, 2024

Done 👍

@juhlig
Copy link
Contributor Author

juhlig commented Nov 9, 2024

There is another failure now, macOS again, not sure what this one is about. Seems unrelated to changes in this PR, anyway 🤷‍♂️

@essen
Copy link
Member

essen commented Nov 12, 2024

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Restart issues found when upgrading 1.8 -> 2.1.0
3 participants