Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cohttp server crash when client dies mid-transfer #511

Closed
hcarty opened this issue Nov 1, 2016 · 18 comments · Fixed by #669
Closed

cohttp server crash when client dies mid-transfer #511

hcarty opened this issue Nov 1, 2016 · 18 comments · Fixed by #669
Labels

Comments

@hcarty
Copy link
Contributor

hcarty commented Nov 1, 2016

The following server will crash (usually) if the wrk process is canceled with a ctrl-c. I've seen at least EPIPE, ECONNRESET, EPROTOTYPE errors while testing this.

This may be related to #503

Using OCaml 4.02.3, cohttp 0.21.0. Tested on 64bit OSX and Linux with libev/conf-libev installed.

Server code:

module Server = Cohttp_lwt_unix.Server

let callback _conn _req body =
  let%lwt body = Cohttp_lwt_body.to_string body in
  Server.respond_string ~status:`OK ~body ()

let server port =
  Server.create ~mode:(`TCP (`Port port)) (Server.make ~callback ())

let () =
  Lwt_engine.set (new Lwt_engine.libev);
  Lwt_main.run (server (int_of_string Sys.argv.(1)))

Client using wrk:

wrk -s put.lua 'http://127.0.0.1:8080/'

The put.lua script creates a blob of bytes to send as a PUT request to the server.

put.lua:

local Chars = {}
for Loop = 0, 255 do
   Chars[Loop+1] = string.char(Loop)
end
local String = table.concat(Chars)

local Built = {['.'] = Chars}

local AddLookup = function(CharSet)
   local Substitute = string.gsub(String, '[^'..CharSet..']', '')
   local Lookup = {}
   for Loop = 1, string.len(Substitute) do
       Lookup[Loop] = string.sub(Substitute, Loop, Loop)
   end
   Built[CharSet] = Lookup

   return Lookup
end

function string.random(Length, CharSet)
   -- Length (number)
   -- CharSet (string, optional); e.g. %l%d for lower case letters and digits

   local CharSet = CharSet or '.'

   if CharSet == '' then
      return ''
   else
      local Result = {}
      local Lookup = Built[CharSet] or AddLookup(CharSet)
      local Range = table.getn(Lookup)

      for Loop = 1,Length do
         Result[Loop] = Lookup[math.random(1, Range)]
      end

      return table.concat(Result)
   end
end

wrk.method = "PUT"
wrk.body   = string.random(1000000, "%l%d")
wrk.headers["Content-Type"] = "text/plain"
@hcarty
Copy link
Contributor Author

hcarty commented Nov 1, 2016

Also, if I wrap Lwt_main.run in a try/with the exception doesn't seem to get captured. I end up with output like this:

Fatal error: exception Unix.Unix_error(Unix.EPROTOTYPE, "write", "")
Raised at file "map.ml", line 117, characters 16-25
Called from file "src/conv.ml", line 215, characters 10-37

instead of the expected:

Exceptional!
module Server = Cohttp_lwt_unix.Server

let callback _conn _req body =
  let%lwt body = Cohttp_lwt_body.to_string body in
  Server.respond_string ~status:`OK ~body ()

let server port =
  Server.create ~mode:(`TCP (`Port port)) (Server.make ~callback ())

let () =
  Lwt_engine.set (new Lwt_engine.libev);
  try
    Lwt_main.run (server (int_of_string Sys.argv.(1)))
  with
  | Unix.Unix_error (_, _, _) -> prerr_endline "Exceptional!"

@hcarty
Copy link
Contributor Author

hcarty commented Nov 2, 2016

And I should have checked, but cohttp-server-lwt shows the same crash when testing against wrk 'http://127.0.0.1/some_local_file'
Worse, the failure will happen at least some of the time if the call to wrk is simply allowed to run through to completion.

@hcarty
Copy link
Contributor Author

hcarty commented Nov 2, 2016

Backtrace from cohttp_server_lwt.native from master:

Fatal error: exception Unix.Unix_error(Unix.EPIPE, "write", "")
Raised by primitive operation at file "src/unix/lwt_bytes.ml", line 144, characters 43-86
Called from file "src/unix/lwt_unix.ml", line 551, characters 17-28

The lwt_bytes.ml trace is in Lwt_bytes.write and lwt_unix.ml is Lwt_unix.wrap_syscall.

Lwt_bytes.write may be coming from a channel originally from Lwt_io.of_fd, called from inside of conduit.

@hcarty
Copy link
Contributor Author

hcarty commented Nov 2, 2016

So it looks like the crash is coming from Lwt's async exception hook. If I set:

Lwt.async_exception_hook := ignore

before the call to Lwt_main.run then the server doesn't crash.

@samoht
Copy link
Member

samoht commented Nov 4, 2016

I've just got the same error. Having a non catchable exception is indeed a bit strange and I had no idea what happened... Thanks for the fix/workaround :-)

samoht added a commit to samoht/irmin that referenced this issue Nov 4, 2016
Something changed either in lwt and or cohttp which triggers this.
See mirage/ocaml-cohttp#511 for the
workaround, which seems to be working fine (at least the tests now
pass).
@hcarty
Copy link
Contributor Author

hcarty commented Nov 4, 2016

@samoht You're quite welcome, I'm glad it's useful! There are still errors, particularly when larger bodies are submitted. But it doesn't crash the server and does not (seem to) leak resources.

A possibly more helpful, definitely more verbose async exception handler that I've been using since writing this:

Lwt.async_exception_hook := (function
  | Unix.Unix_error (error, func, arg) ->
    Logs.warn (fun m ->
      m  "Client connection error %s: %s(%S)"
        (Unix.error_message error) func arg
    )
  | exn -> Logs.err (fun m -> m "Unhandled exception: %a" Fmt.exn exn)
);

@talex5
Copy link
Contributor

talex5 commented Nov 4, 2016

I suspect the culprit is here, in Conduit_lwt_unix.serve:

https://github.com/mirage/ocaml-conduit/blob/5b71d27358604faf55d0620288a6e709e9e7cd72/lib/conduit_lwt_unix.ml#L350

Cohttp handles exceptions from the user's callback function, but not errors writing the response after the user callback returns successfully:

callback-failed

@SGrondin
Copy link
Contributor

SGrondin commented Nov 4, 2016

Please correct me if I'm wrong:

  • The exception should not bubble up to the callback, because the callback is already done executing, it happens after the callback in the lifecycle. This is fine.
  • The exception should not bubble up to the Server.create call, because a lost connection shouldn't break the entire server. This is fine.
  • Other frameworks in other languages address this problem by letting users register a connection error handler on the server itself. I think this is the right solution. The problem is that this handler doesn't exist in cohttp, therefore the library passes it to the global Lwt handler. It's normal and expected that a call to Lwt_main.run crash the entire Lwt engine if there is no registered global Lwt exception handler.

@talex5
Copy link
Contributor

talex5 commented Nov 5, 2016

I'm not sure what a user-supplied connection error handler would do. The client of a web-server is always free to close the connection at any point, so this really isn't an error.

I think the real problem here is that cohttp's IO type doesn't distinguish between client/network errors (which are expected, and should be handled by the caller) and bugs (which can be raised as exceptions). So, I think all functions in IO should return a result type. Including IO.read, whose documentation currently says It returns an empty string if EOF or some other error condition occurs!

The server loop would handle errors from IO.write or flush by logging at debug level, but otherwise ignoring them.

@talex5
Copy link
Contributor

talex5 commented Nov 5, 2016

Also, conduit's error handling is a little unusual. Instead of using Lwt.async, it does:

    let c = callback (flow_of_fd client peeraddr) ic oc in
    let events = match timeout with
      |None -> [c]
      |Some t -> [c; (Lwt_unix.sleep (float_of_int t)) ] in
    let _ = Lwt.pick events >>= fun () -> close (ic,oc) in

Since this swallows all errors, it then wraps callback to call the exception hook manually:

  let callback flow ic oc = Lwt.catch
    (fun () -> callback flow ic oc)
    (fun exn -> !Lwt.async_exception_hook exn; Lwt.return_unit)
  in

This makes tracing it a bit harder than necessary, because Lwt doesn't report the callback handler as an async thread. It also suppresses errors from e.g. close.

@hcarty
Copy link
Contributor Author

hcarty commented Nov 7, 2016

@talex5 I have a small, local patch to conduit that I made while trying to track this issue down. All it does is wrap the let _ = Lwt.pick ... in Lwt.async instead. Do you think this is worth passing upstream?

@talex5
Copy link
Contributor

talex5 commented Nov 7, 2016

I can't speak for the cohttp maintainers, but I think it would be good to clean things up here. Probably the whole of the new-connection-handling code should run in async (not just the pick), and it should probably catch and log all exceptions from a connection.

callback then wouldn't need to be wrapped. But then you have to check the error handling in other places that callback could end up, so it might be a bit of work.

@avsm
Copy link
Member

avsm commented Nov 7, 2016

I can't speak for the cohttp maintainers, but I think it would be good to clean things up here. Probably the whole of the new-connection-handling code should run in async (not just the pick), and it should probably catch and log all exceptions from a connection.
callback then wouldn't need to be wrapped. But then you have to check the error handling in other places that callback could end up, so it might be a bit of work.

This sounds sensible to me. That code was written a very long time ago, and our understanding of Lwt error handling idioms has moved on since then...

@SGrondin
Copy link
Contributor

SGrondin commented Nov 7, 2016

I'm not sure what a user-supplied connection error handler would do.

In my experience, in those languages, users will set 2 handlers: on connection close and error. They act as a catch or a finally block, so you can do some custom logging, state handling (not super useful in OCaml), etc.

I'm not saying this is the way to go, I'm presenting it as an alternative. If there's no error handler, then it would go to the async_exception_hook. I know I'd personally use it to log it all properly.

@darioteixeira
Copy link

I came across this bug in a fun way: simply running nmap on a network where a bunch of cohttp-based servers were running caused them all to crash. Applying @hcarty's suggested temporary fix (setting Lwt.async_exception_hook := ignore before Lwt_main.run) seems to get around the problem for now...

pirbo added a commit to darklang/dark that referenced this issue Sep 22, 2017
@hannesm
Copy link
Member

hannesm commented Sep 23, 2018

came across this as well, any update on a more permanent fix than ignoring all exceptions?

@avsm
Copy link
Member

avsm commented Sep 23, 2018

The permanent fix is for someone to contribute a PR that modernises the interface as @talex5 suggests above: #511 (comment)

Contributions welcome to that effect!

@talex5
Copy link
Contributor

talex5 commented Jul 16, 2019

Server.create now takes an ?on_exn argument that can be used to log errors instead of crashing.

For a quick fix, we could just change the default exception handler to one that logs. Cohttp already catches exceptions from the user callback separately, so anything that gets here is going to be an error writing the response. We could log Unix.Unix_error at info or debug level and other exceptions as errors.

However, that only fixes the Unix backend. To fix e.g. Mirage too, it might be better to change the IO signature to return network errors using result. However, this would likely be a large API change for Cohttp.

IO says that an IO function "returns an empty string if EOF or some other error condition occurs on the input channel". However, the Mirage and lwt-unix implementations both appear to raise exceptions on error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants