Reader is not waken up in case of immediate header flush with zero response body size #162

Lupus · 2019-10-27T19:56:38Z

First test in the below snippet demonstrates the problem, while the second test passes just fine:

  let test_immediate_flush_on_zero_length_body () =
    let reader_woken_up = ref false in
    let continue_response = ref (fun () -> ()) in
    let error_handler ?request:_ _ = assert false in
    let response = Response.create `No_content in
    let request_handler reqd =
      continue_response := (fun () ->
        let resp_body = Reqd.respond_with_streaming
          ~flush_headers_immediately:true reqd response
        in
        Body.close_writer resp_body;
      );
    in
    let t = create ~error_handler request_handler in
    reader_ready t;
    yield_writer t (fun () -> write_response t ~body:"" response);
    read_string t "HEAD / HTTP/1.1\r\nHost: example.com\r\n\r\n";
    yield_reader t (fun () -> reader_woken_up := true);
    !continue_response ();
    writer_yielded t;
    Alcotest.(check bool) "Reader woken up"
      true !reader_woken_up;
	;;

  let test_length_body_no_immediate_flush () =
    let reader_woken_up = ref false in
    let continue_response = ref (fun () -> ()) in
    let error_handler ?request:_ _ = assert false in
    let response = Response.create `No_content in
    let request_handler reqd =
      continue_response := (fun () ->
        let resp_body = Reqd.respond_with_streaming
          ~flush_headers_immediately:false reqd response
        in
        Body.close_writer resp_body;
      );
    in
    let t = create ~error_handler request_handler in
    reader_ready t;
    writer_yielded t;
    read_string t "HEAD / HTTP/1.1\r\nHost: example.com\r\n\r\n";
    yield_reader t (fun () -> reader_woken_up := true);
    !continue_response ();
    write_response t ~body:"" response;
    writer_yielded t;
    Alcotest.(check bool) "Reader woken up"
      true !reader_woken_up;
	;;

Reproduces with 0.6.5 and 0.6.4. Subsequent request on keep-alive connection gets stuck.

Lupus · 2019-11-01T11:59:33Z

Looks like this also happens with non-empty replies as well. What we observe in real service is that it just gets stuck at second request in the keep-alive connection when there is nginx in front with slice-caching enabled (nginx sends multiple range requests in keep-alive connections).

Lupus · 2019-11-01T15:08:46Z

I was unable to get a repro test case with data being sent, and digging further narrowed this down to a bug in the service itself (Lwt_stream.t was never closed, which resulted in Body.t to never be closed as well). Sorry for the noise, symptoms were very similar to this issue.

Lupus · 2019-12-15T11:20:19Z

@seliopou have you had a chance to look at the test cases? Do they make sense?

dpatti · 2020-01-13T00:53:58Z

I'm taking a look at this now (sorry for the delay!) and I think I have my head wrapped around what's happening. That said, it's hard to tell if this is an issue with the tests or not. For starters, I changed your first test slightly, to look like this:

    reader_ready t;
    yield_writer t (fun () -> writer_woken_up := true);
    read_string t "HEAD / HTTP/1.1\r\nHost: example.com\r\n\r\n";
    yield_reader t (fun () -> reader_woken_up := true);
    !continue_response ();
    Alcotest.(check bool) "Writer woken up"
      true !writer_woken_up;
    write_response t ~body:"" response;
    writer_yielded t;
    Alcotest.(check bool) "Reader woken up"
      true !reader_woken_up;

and it worked just fine. Note that I pulled the call to write_response out of the yield_writer callback, but I still made sure the callback was firing. So it's a re-entrancy issue of sorts. This is what I think the call stack looks like in your original test, roughly:

!continue_repsonse ()
respond_with_streaming
done_waiting when_done_waiting
<yield_writer callback>
write_response

Note that we haven't called Body.close_writer before checking the write result and flushing the data. But this isn't strictly a problem. The test continues to close the body, and we get to writer_yielded, which does two things through next_write_operation: advance_request_queue_if_necessary, and flush_response_body. These two have a co-dependency on each other, so it's tricky because in this situation we need to advance the request queue in order to wake up the reader, but we can't advance the request queue until the final close for the body is flushed. You can see this in action by duplicating the final writer_yielded t; line before the assertion – the extra call puts the reqd in the right state, and it passes fine.

Part of the problem here is that we are updating state in our next_(read|write)_operation functions. We actually have a branch where we've been working on fixing that. I think it'd probably be good for me to pick it up again, since I think it helps with a lot of these hard-to-debug re-entrancy problems.

Out of curiosity, are you using your own driver as opposed to the provided one of lwt/async drivers? Or did you just try to boil an observed bug down to a simple test?

Lupus · 2020-01-16T13:49:42Z

We do use custom Lwt driver, as we needed TLS support. And this test case is a distilled repro case of what we observed in real service.

Actually @anmonteiro has fixed this issue in his fork (but I have not tried it unfortunately, so I can't confirm the resolution on real world service), I'm not that strong on httpaf internals, but probably it's worth joining the efforts and merge the fix to httpaf if possible.

Lupus · 2020-04-01T12:52:44Z

Hey @dpatti @seliopou we've observed this issue in one of our to-be-production services again, and ~flush_headers_immediately:false workaround does not work when there are several requests with zero length on keep-alive connection. Applying anmonteiro/httpun#34 fixes not only tests that I posted in this issue, but also fixes real world service, which exhibits the bug.

Are there any concerns regarding upstreaming the fix?

seliopou · 2020-05-11T03:35:38Z

@Lupus did #181 fix this?

Lupus · 2020-05-12T12:54:55Z

We'll need to test our services against latest master to see if it's stable. I'll update once we have a time slot for this. You can close this issue if you're certain that #181 fixes this, I'll reopen in case we run into this after an upgrade.

seliopou closed this as completed May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reader is not waken up in case of immediate header flush with zero response body size #162

Reader is not waken up in case of immediate header flush with zero response body size #162

Lupus commented Oct 27, 2019

Lupus commented Nov 1, 2019

Lupus commented Nov 1, 2019

Lupus commented Dec 15, 2019

dpatti commented Jan 13, 2020

Lupus commented Jan 16, 2020

Lupus commented Apr 1, 2020

seliopou commented May 11, 2020

Lupus commented May 12, 2020

Reader is not waken up in case of immediate header flush with zero response body size #162

Reader is not waken up in case of immediate header flush with zero response body size #162

Comments

Lupus commented Oct 27, 2019

Lupus commented Nov 1, 2019

Lupus commented Nov 1, 2019

Lupus commented Dec 15, 2019

dpatti commented Jan 13, 2020

Lupus commented Jan 16, 2020

Lupus commented Apr 1, 2020

seliopou commented May 11, 2020

Lupus commented May 12, 2020