Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dream server process consumes 100% CPU after websocket client is closed #230

Open
barko opened this issue Jun 16, 2022 · 14 comments
Open

Dream server process consumes 100% CPU after websocket client is closed #230

barko opened this issue Jun 16, 2022 · 14 comments

Comments

@barko
Copy link

barko commented Jun 16, 2022

I'm experimenting with websockets in Dream.

With this code, I've observed problematic behavior on version 1.0.0~alpha4 (and OCaml 4.14.0, Ubuntu 20.04), installed via opam. Initially, the program behaves as expected, with integers printed to the browser's JS console. I expect that once the browser tab is closed, Dream.send would raise an exception. Instead, the server process starts consuming 100% CPU.

@novemberkilo
Copy link

We tried out using websockets on ocaml-ci with 1.0.0~alpha5 and are running into the same issue.

@aantron
Copy link
Owner

aantron commented Apr 7, 2023

I will take a look shortly, thank you!

@aantron aantron added this to the alpha6 milestone Apr 17, 2023
@aantron aantron self-assigned this Apr 25, 2023
@aantron
Copy link
Owner

aantron commented Apr 26, 2023

This appears to be an issue upstream in Websocket/af and/or Gluten. I used GDB to get a backtrace and added two prints before calls to the deepest functions that were in Dream, Websocket/af, or Gluten.

In Dream's src/http/shared/websocket.ml:

diff --git a/src/http/shared/websocket.ml b/src/http/shared/websocket.ml
index 358d577..eec9b6e 100644
--- a/src/http/shared/websocket.ml
+++ b/src/http/shared/websocket.ml
@@ -196,6 +196,7 @@ let websocket_handler stream socket =
         if !closed then
           close !close_code
         else begin
+          prerr_endline "Dream websocket.ml: calling Websocketaf.Wsd.schedule";
           Websocketaf.Wsd.schedule socket ~kind buffer ~off:offset ~len:length;
           bytes_since_flush := !bytes_since_flush + length;
           if !bytes_since_flush >= 4096 then

This print is triggered a finite number of times before entering the infinite loop.

In src/vendor/gluten/lwt/gluten_lwt.ml:

diff --git a/lwt/gluten_lwt.ml b/lwt/gluten_lwt.ml
index 661b50a..6037f9a 100644
--- a/lwt/gluten_lwt.ml
+++ b/lwt/gluten_lwt.ml
@@ -39,6 +39,8 @@ open Lwt.Infix
 module Buffer = Gluten.Buffer
 include Gluten_lwt_intf
 
+let counter = ref 0
+
 module IO_loop = struct
   let start :
       type t fd.
@@ -85,6 +87,8 @@ module IO_loop = struct
       let rec write_loop_step () =
         match Runtime.next_write_operation t with
         | `Write io_vectors ->
+          Printf.eprintf "Gluten_lwt: writev_io_vectors: %i\n%!" !counter;
+          incr counter;
           writev io_vectors >>= fun result ->
           Runtime.report_write_result t result;
           write_loop_step ()

This second print in Gluten is triggered indefinitely, running up the counter. The output looks as below. The two requests are me opening a browser tab to http://localhost:8080 and then it creating a WebSocket. I close the tab manually around Gluten_lwt: writev_io_vectors: 5, but these messages continue indefinitely, counting up, without any more prints from the call site in Dream. The stack trace continues to show the same call site in Dream's websocket.ml, so this appears to be an infinite loop inside Websocket/af or Gluten.

@anmonteiro, would you be able to comment on this?

$ dune exec ./www.exe

Entering directory '/home/antron/code/dream'
Done: 99% (847/848, 1 left) (jobs: 1)  
26.04.23 12:15:43.486                       Running at http://localhost:8080
26.04.23 12:15:43.486                       Type Ctrl+C to stop
26.04.23 12:15:45.378    dream.logger  INFO REQ 1 GET / 127.0.0.1:47496
26.04.23 12:15:45.378    dream.logger  INFO REQ 1 200 in 105 μs
Gluten_lwt: writev_io_vectors: 0
26.04.23 12:15:45.717    dream.logger  INFO REQ 2 GET /websocket 127.0.0.1:47510
26.04.23 12:15:45.717    dream.logger  INFO REQ 2 101 in 73 μs
Gluten_lwt: writev_io_vectors: 1
Dream websocket.ml: calling Websocketaf.Wsd.schedule
Gluten_lwt: writev_io_vectors: 2
Dream websocket.ml: calling Websocketaf.Wsd.schedule
Gluten_lwt: writev_io_vectors: 3
Dream websocket.ml: calling Websocketaf.Wsd.schedule
Gluten_lwt: writev_io_vectors: 4
Dream websocket.ml: calling Websocketaf.Wsd.schedule
Gluten_lwt: writev_io_vectors: 5
Gluten_lwt: writev_io_vectors: 6
Gluten_lwt: writev_io_vectors: 7
Gluten_lwt: writev_io_vectors: 8
Gluten_lwt: writev_io_vectors: 9
...

@aantron
Copy link
Owner

aantron commented Dec 1, 2023

FYI this issue and related issues in websocket/af are also apparently the cause of the playground repeatedly stopping (it hangs with 100% CPU, backtraces and printfs show that it is in websocket/af). I'm debugging it gradually.

@hansole
Copy link

hansole commented Dec 10, 2023

I ran into the same problem. If I close the websocket from the server, it seems to work fine. But if I open a websocket, and then move away from the page so that the browser closes it from the client side, the server process starts consuming 100% CPU after about 20 seconds. I don't have to use (send on) the socket for this to happen.

Output from testing is here:
dream_webockets.out.txt
First it runs for 30 seconds and closes the websocket from the server. The second time it runs into a loop.
It shows some additional trace printing related to closing.
dreamAndFaraday.diff.txt
The program used is:
websocket.eml.ml.txt

Since the logs show that read() in websocket.ml get a close notification when client closes the connection, I tried having a receive 'monitor' started when the client connects. When it get a 'None', it calls to close the socket. This seems to prevent the situation that it runs into the CPU issue.

I think I can use it as a workaround for now, but I hope a proper fix for this bug will be found.

@aantron
Copy link
Owner

aantron commented Dec 12, 2023

Thank you @hansole!

@hansole
Copy link

hansole commented Dec 17, 2023

The read trick only works when the lost connection get signaled by the OS. If the server is running on a remote node and the net between them goes down, it still runs into the 100% CPU loop.

I have been investigating some more, and it seems to be stuck in a loop where it tries to flush some write buffers while the underlying connection is closed, so it can not make any progress.

https://github.com/anmonteiro/gluten/blob/166e1e917710e1e43b04d33a368b6701a9f8b1f5/lwt/gluten_lwt.ml#L94-L99
https://github.com/inhabitedtype/faraday/blob/c66fc020fdc2327556c36fc1e912524b427d7da8/lib/faraday.ml#L420-L434
https://github.com/inhabitedtype/websocketaf/blob/d2460e57adc248b8e6cb520222da3da96f0efb05/lib/wsd.ml#L61

I don't have complete understanding of the interaction between the different layers involved, so I might be wrong. But to me, it seems some state is missing here. If close get called from the server, it sets the state to "closed" and then it flushes buffers and exit the write loop and closes the underlying connection. If the socket get closed from the OS, it gets returned that it is closed.

https://github.com/inhabitedtype/faraday/blob/c66fc020fdc2327556c36fc1e912524b427d7da8/lwt_unix/faraday_lwt_unix.ml#L18

It still tries to flush the buffers which has no effect, and it ends up in a loop. To me it seems like the call to close from the server application should not set state to "closed", but "closing". Once the buffers are flushed, then it can move the state to "closed" and terminate the loop. It it get closed by the OS, then the state should move directly to "closed", and it should not make any attempt to empty buffers.

This is my understanding of the problem, and I might be wrong on how this actually works.

@hansole
Copy link

hansole commented Dec 21, 2023

The following change to gluten prevents it from going into the loop.
gluten_lwt.ml.diff.txt
It's probably not the best or the correct way, and it should probably not "lie", notifying that the read has exited. But it seems to work OKish, so it will have to do for now.

It still seems to be a problem if the network, between server and client. is cut before calling close. Even when calling it from the application, calling Dream.close_websocket. The socket is still being held by the application, and netstat report it as established.

@aantron. is the ping-pong protocol supported from websocket? It seems to be possible to call it on stream, but not on websocket.

Dream is using "vendor" for some libraries. I was not able to figure out how this "vendor" patches are specified and fetched, I have tried to find documentation for this, but no luck. Do you know where to find documentation?

@aantron
Copy link
Owner

aantron commented Dec 21, 2023

@hansole Thanks for continuing to look into this!

Ping and pong are supportd by the Dream API and by the latest forked websocket/af. The versions used by Dream are git submodules listed here. For websocket/af, Dream is using @anmonteiro's fork, but I patched it in a minimal way to rename the modules from Websocketaf to Dream_websocketaf so that when you install Dream, it doesn't have an internal module Websocketaf that might conflict with module Websocketaf coming from package websocketaf if you also install that in your opam switch. The submodule links point to the patches in my forks of @anmonteiro's forks. These patches are the only things my forks are for.

@aantron
Copy link
Owner

aantron commented Dec 21, 2023

The way I typically debug Dream together with these upstream libraries is that after git clone https://github.com/aantron/dream.git --recursive, I go to src/vendor/websocketaf and directly edit the code. The code in that directory is another git repo (a clone of my websocket/af fork), but Dream compiles against it directly.

@aantron
Copy link
Owner

aantron commented Dec 21, 2023

The part that fetches these patches is the --recursive flag during clone.

@aantron
Copy link
Owner

aantron commented Dec 21, 2023

And it's briefly documented in Contributing in the README, though not with this much detail.

@anmonteiro
Copy link

this should have been fixed in anmonteiro/httpun-ws#73.

There are other fixes in httpun-ws recently, too, which may be of interest to dream, including #214

@mebsout
Copy link

mebsout commented Dec 12, 2024

I still experience this issue (e.g. using the repro example provided by #230 (comment), using websocat or wscat as a client) using dream 1.0.0~alpha8 and httpun-ws 0.2.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants