Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reader is not waken up in case of immediate header flush with zero response body size #162

Closed
Lupus opened this issue Oct 27, 2019 · 8 comments

Comments

@Lupus
Copy link
Contributor

Lupus commented Oct 27, 2019

First test in the below snippet demonstrates the problem, while the second test passes just fine:

  let test_immediate_flush_on_zero_length_body () =
    let reader_woken_up = ref false in
    let continue_response = ref (fun () -> ()) in
    let error_handler ?request:_ _ = assert false in
    let response = Response.create `No_content in
    let request_handler reqd =
      continue_response := (fun () ->
        let resp_body = Reqd.respond_with_streaming
          ~flush_headers_immediately:true reqd response
        in
        Body.close_writer resp_body;
      );
    in
    let t = create ~error_handler request_handler in
    reader_ready t;
    yield_writer t (fun () -> write_response t ~body:"" response);
    read_string t "HEAD / HTTP/1.1\r\nHost: example.com\r\n\r\n";
    yield_reader t (fun () -> reader_woken_up := true);
    !continue_response ();
    writer_yielded t;
    Alcotest.(check bool) "Reader woken up"
      true !reader_woken_up;
	;;

  let test_length_body_no_immediate_flush () =
    let reader_woken_up = ref false in
    let continue_response = ref (fun () -> ()) in
    let error_handler ?request:_ _ = assert false in
    let response = Response.create `No_content in
    let request_handler reqd =
      continue_response := (fun () ->
        let resp_body = Reqd.respond_with_streaming
          ~flush_headers_immediately:false reqd response
        in
        Body.close_writer resp_body;
      );
    in
    let t = create ~error_handler request_handler in
    reader_ready t;
    writer_yielded t;
    read_string t "HEAD / HTTP/1.1\r\nHost: example.com\r\n\r\n";
    yield_reader t (fun () -> reader_woken_up := true);
    !continue_response ();
    write_response t ~body:"" response;
    writer_yielded t;
    Alcotest.(check bool) "Reader woken up"
      true !reader_woken_up;
	;;

Reproduces with 0.6.5 and 0.6.4. Subsequent request on keep-alive connection gets stuck.

@Lupus
Copy link
Contributor Author

Lupus commented Nov 1, 2019

Looks like this also happens with non-empty replies as well. What we observe in real service is that it just gets stuck at second request in the keep-alive connection when there is nginx in front with slice-caching enabled (nginx sends multiple range requests in keep-alive connections).

@Lupus
Copy link
Contributor Author

Lupus commented Nov 1, 2019

I was unable to get a repro test case with data being sent, and digging further narrowed this down to a bug in the service itself (Lwt_stream.t was never closed, which resulted in Body.t to never be closed as well). Sorry for the noise, symptoms were very similar to this issue.

@Lupus
Copy link
Contributor Author

Lupus commented Dec 15, 2019

@seliopou have you had a chance to look at the test cases? Do they make sense?

@dpatti
Copy link
Collaborator

dpatti commented Jan 13, 2020

I'm taking a look at this now (sorry for the delay!) and I think I have my head wrapped around what's happening. That said, it's hard to tell if this is an issue with the tests or not. For starters, I changed your first test slightly, to look like this:

    reader_ready t;
    yield_writer t (fun () -> writer_woken_up := true);
    read_string t "HEAD / HTTP/1.1\r\nHost: example.com\r\n\r\n";
    yield_reader t (fun () -> reader_woken_up := true);
    !continue_response ();
    Alcotest.(check bool) "Writer woken up"
      true !writer_woken_up;
    write_response t ~body:"" response;
    writer_yielded t;
    Alcotest.(check bool) "Reader woken up"
      true !reader_woken_up;

and it worked just fine. Note that I pulled the call to write_response out of the yield_writer callback, but I still made sure the callback was firing. So it's a re-entrancy issue of sorts. This is what I think the call stack looks like in your original test, roughly:

!continue_repsonse ()
respond_with_streaming
done_waiting when_done_waiting
<yield_writer callback>
write_response

Note that we haven't called Body.close_writer before checking the write result and flushing the data. But this isn't strictly a problem. The test continues to close the body, and we get to writer_yielded, which does two things through next_write_operation: advance_request_queue_if_necessary, and flush_response_body. These two have a co-dependency on each other, so it's tricky because in this situation we need to advance the request queue in order to wake up the reader, but we can't advance the request queue until the final close for the body is flushed. You can see this in action by duplicating the final writer_yielded t; line before the assertion – the extra call puts the reqd in the right state, and it passes fine.

Part of the problem here is that we are updating state in our next_(read|write)_operation functions. We actually have a branch where we've been working on fixing that. I think it'd probably be good for me to pick it up again, since I think it helps with a lot of these hard-to-debug re-entrancy problems.

Out of curiosity, are you using your own driver as opposed to the provided one of lwt/async drivers? Or did you just try to boil an observed bug down to a simple test?

@Lupus
Copy link
Contributor Author

Lupus commented Jan 16, 2020

We do use custom Lwt driver, as we needed TLS support. And this test case is a distilled repro case of what we observed in real service.

Actually @anmonteiro has fixed this issue in his fork (but I have not tried it unfortunately, so I can't confirm the resolution on real world service), I'm not that strong on httpaf internals, but probably it's worth joining the efforts and merge the fix to httpaf if possible.

@Lupus
Copy link
Contributor Author

Lupus commented Apr 1, 2020

Hey @dpatti @seliopou we've observed this issue in one of our to-be-production services again, and ~flush_headers_immediately:false workaround does not work when there are several requests with zero length on keep-alive connection. Applying anmonteiro/httpun#34 fixes not only tests that I posted in this issue, but also fixes real world service, which exhibits the bug.

Are there any concerns regarding upstreaming the fix?

@seliopou
Copy link
Member

@Lupus did #181 fix this?

@Lupus
Copy link
Contributor Author

Lupus commented May 12, 2020

We'll need to test our services against latest master to see if it's stable. I'll update once we have a time slot for this. You can close this issue if you're certain that #181 fixes this, I'll reopen in case we run into this after an upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants