Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency issue with server-side header encoding #45

Open
psalin opened this issue Jan 17, 2021 · 25 comments
Open

Concurrency issue with server-side header encoding #45

psalin opened this issue Jan 17, 2021 · 25 comments

Comments

@psalin
Copy link
Collaborator

psalin commented Jan 17, 2021

When handling concurrent requests, the server may send out an HTTP/2 HEADERS frame that refers to headers in the hpack dynamic table that are not yet allocated at the client side. This causes the client to fail to decode the headers and in case of the grpcbox client it causes it to crash. This issue can be reproduced by running the unary concurrency test of PR #41.

The issue here can best be seen from a PCAP dump of the traffic generated by the test. In the attached dump, the first HEADERS sent by the server occurs in packet 324 (stream ID 101). Here the HTTP/2 Header Block Fragment contains the headers in long format as its the first time they are sent. The client decodes 3 dynamic headers and now its hpack dynamic table contains values at positions 62, 63 and 64.

Next in packet 325 the server sends a HEADERS for stream 127. This message has a Header Block Fragment of 88c2c1c0, of which c2,c1 and c0 are referring to the dynamic table positions 66, 65, and 64. Since the client has not received any headers yet for position 66 and 65, it fails to decode the headers. The message sent by the server is invalid.

The same error occurs in packet 327 for the next stream 117.

Finally in packet 328, the server sends the trailers for the original stream 101, these 2 are also in the long format and once the client has decoded them it will have values at positions 62-66. At this point the invalid packets (325 and 327) sent earlier match the dynamic headers they are referring to in a correct way. This likely means that the order that the server has encoded the messages is not guaranteed to be the order they are sent out in. If they were, packet 328 with the trailers would have been sent out before 325 and 327.

fail.pcap.gz

@tsloughter
Copy link
Owner

Pretty sure I understand the issue but still not the solution. I need to look over the http2 spec. To make sure I understand, the issue seems to be: we encode headers for stream N, creating an updated hpack context which we use as the hpack context when sending headers on stream N+2 resulting in it using table positions, but I can't imagine there is any requirement about sending or handling of frames based on their stream ids -- maybe there is and that is what is missing, but I would expect, what with all the dynamics of streams and frame windows that it is supposed to not fall apart if stream N+2's frames are handled before stream N's. So then the hpack context would have to be per-stream, but that isn't correct... so I'm a bit stumped, hopefully digging through the spec will make it clear.

@tsloughter
Copy link
Owner

Also, based on the fact that the pcap shows the headers as <undefined> in wireshark means we can conclude it is the fault of the server using indexes unknown to the client and not a potential bug in the client not updating its index properly?

@tsloughter
Copy link
Owner

Hm, reading into the spec and a spot I could see being buggy is all header frames must be sent without interleaving any other frames to other streams. I don't know that that is what is happening, I don't think the pcap data is pointing to that, but a guess.

@tsloughter
Copy link
Owner

Meh, I am not seeing how this would happen on the server side. It sends all headers without any possible interleaving in h2_connection, grpcbox calls h2_connect:send_headers from grpcbox_stream. I'm less sure about trailers and the client, but the issue appears to come up in what the server sends... so I'm lost at the moment.

Have you come up with any ideas?

@psalin
Copy link
Collaborator Author

psalin commented Jan 23, 2021

This is a server side issue in this case. The trailers are sent by the server. The order is as in the sample unary request here: https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md#example.

I took some traces from the code now just before hpack:encode() was called in both h2_connection:send_headers_() and h2_connection:handle_event(_, {send_trailers, ...}). I needed to store the trace in an ets table and print it afterwards, since printing when hitting the functions messes the timings so that the issue does not reproduce.
Anyway, for a sample run, in the code the ordering was as follows:

 {1,send_headers,81},
 {1,send_trailers,81},
 {1,send_headers,5},
 {1,send_headers,7},
 {1,send_trailers,5},
 {1,send_trailers,7},
 {1,send_headers,11},
 {1,send_trailers,11},

But in the resulting test.pcap.gz, the order was:

Packet 318: HEADERS[81],
Packet 321: HEADERS[5],
Packet 322: HEADERS[81], (trailers)
Packet 323: HEADERS[7],

So to me it looks like trailers might be encoded before headers of another stream, but still sent out after them. The hpack encoding is per HTTP/2 connection and not per stream as you said. I think in order to fulfil the protocol, HEADERS (regardless of stream) need to be sent out in the order that the hpack encoding has been done, otherwise the dynamic header tables cannot stay in sync between client and server.

I am not very familiar yet with the execution flow inside the server, so not currently sure how the trailers get overtaken by headers from other streams in the code.

@tsloughter
Copy link
Owner

I must have misunderstood, you think the issue is related to the trailers and headers sharing some values? As in, the trailers of stream N are encoded, updating the table, and before being sent the headers of stream N+2 are encoded with the updated table, and they share some values so indices are used in the encoded header frames instead of their full values?

@tsloughter
Copy link
Owner

I do see trailers are sent in h2_stream_set (in two places...) while headers are sent in h2_connection. Maybe that is how the order ultimately gets mixed up. I still struggle to read parts of chatterbox :)

@tsloughter
Copy link
Owner

After trying to trace the related chatterbox modules I just added some log messages and see it happening (I'm running your test) but seems so odd...

Trailers are sent with the second set using the updated dynamic table:

*** User 2021-01-23 08:02:00.976 ***
Stream: 1 SEND TRAILERS [{{frame_header,30,1,5,1},
                          {headers,undefined,
                                   <<64,11,103,114,112,99,45,115,116,97,116,
                                     117,115,1,48,64,12,103,114,112,99,45,
                                     109,101,115,115,97,103,101,0>>}}]


*** User 2021-01-23 08:02:00.976 ***
Stream: 3 SEND TRAILERS [{{frame_header,2,1,5,3},
                          {headers,undefined,<<"¿¾">>}}]

Then later the headers for stream 1 and 3 are decoded and 2 are undefined, but they are decoded in the h2_connection process meaning the stream 3 decode should be using the table updated when decoding stream 1:

*** User 2021-01-23 08:02:01.320 ***
Stream: 1 HEADERS [{<<":status">>,<<"200">>},
                   {<<"user-agent">>,<<"grpc-erlang/0.1.0">>},
                   {<<"content-type">>,<<"application/grpc+proto">>},
                   {<<"grpc-encoding">>,<<"identity">>}]


*** User 2021-01-23 08:02:01.320 ***
Stream: 3 HEADERS [{<<":status">>,<<"200">>},
                   undefined,undefined,
                   {<<"user-agent">>,<<"grpc-erlang/0.1.0">>}]

@tsloughter
Copy link
Owner

I'm logging the dynamic table too now and it is clearly updated. The table when failing to decode the trailers is:

{hpack_context,
  {dynamic_table,
    [{62,
     <<"grpc-encoding">>,
     <<"identity">>},
     {63,
     <<"content-type">>,
     <<"application/grpc+proto">>},
     {64,
     <<"user-agent">>,
     <<"grpc-erlang/0.1.0">>}],
4096,
178},
4096}

@tsloughter
Copy link
Owner

Oh,you were saying it was getting values 65 and 66, which don't exist. And now I see:

{hpack_context,
 {dynamic_table,
  [{62,
    <<"grpc-message">>,
    <<>>},
   {63,
    <<"grpc-status">>,
    <<"0">>},
   {64,
    <<"grpc-encoding">>,
    <<"identity">>},
   {65,
    <<"content-type">>,
    <<"application/grpc+proto">>},
   {66,
    <<"user-agent">>,
    <<"grpc-erlang/0.1.0">>}],
  4096,
  266},
 4096}

Is the table after the trailers have been decoded. I was mixing up headers and trailers.

So the table has the proper entries for the headers sent as indices, its that their indices differ between the server and client. Probably what you already tried to explain, hehe, but now I see it :)

@tsloughter
Copy link
Owner

Yea, ok, so it encodes the trailers, updating the encoding context in the connection process. But it only enqueues the trailers for sending, it doesn't send them immediately. This must be why the encoding context for the next set of headers has been updated like trailers had been sent but they hadn't necessarily. The headers are sent inline and not by enqueuing them.

I guess one or the other has to change. Either trailers need to be sent immediately, inline when encoded, or headers need to be enqueued to the stream set.

Not sure which is the right way to go. I would have assumed moving it all to being enqueued but headers have some special rule about sending them all with nothing else in between, if I read the spec correctly... though maybe that is still achievable with the stream_set enqueuing method.

@tsloughter
Copy link
Owner

Worse, the trailers are sent by casting a message to the stream process, so even enqueuing likely still means it is a race. So the only option (with the current design at least) is to send them in the connection process.

@tsloughter
Copy link
Owner

Not so easy since there could still be some body left to send when send_trailers is called... seems this will require more refactoring than I'd hoped.

@tsloughter
Copy link
Owner

Considering just removing index updates for trailers for now... It is the simplest solution and the optimization can return when I do a refactoring soon.

@tsloughter
Copy link
Owner

@psalin let me know if this is acceptable tsloughter/chatterbox#4 and I'll merge/release it and grpcbox.

@psalin
Copy link
Collaborator Author

psalin commented Jan 24, 2021

I don't think it can be fixed like that, it breaks HTTP/2 interoperability. The client will still decode the trailers and add them to the hpack context and that changes the indexes. And since the indexes work so that the newest one is always put to the first position and the existing ones are moved back, then the indexes sent by the server will refer to the wrong headers. Even if that would be changed not to happen in grpcbox it would still break interoperability with any other GRPC/HTTP2 client.

@psalin
Copy link
Collaborator Author

psalin commented Jan 24, 2021

Both send_headers and send_trailers are part of the h2_connection gen_statem so at least at that point the order is guaranteed. For headers sock:send() is called directly so they should go out in order. For trailers h2_stream_set:send_what_we_can is() called, which then I guess means they might not be sent yet since we see headers pass them?

@tsloughter
Copy link
Owner

Ugh, so I figured it'd work since chatterbox will always send the whole header, so they aren't looked up in the table by any implementation. But yea, if the index of a header is changed by the insertion of the trailer then it'll break when chatterbox later sends a header by index.. grrr

I really don't think there is a way to fix this in the current design without some big hacks. I guess I can try to figure out just how bad those hacks would be.

@psalin
Copy link
Collaborator Author

psalin commented Jan 25, 2021

Yes, seems this might be an issue that cannot be fixed with a few lines. Basically sending of the trailers and headers which are both HTTP2 HEADERS frames need to be handled in the same place or at least synced somehow so that the order stays. I will try to look at the code too when I can find time.

The good thing from grpcbox point of view as that it seems to send the same trailers always (at least if there no errors) so once the trailers have been successfully received by the client once, then all subsequent messages will just reuse the indexes and it seems to work well for the remainder of the HTTP connection. Seems the issue would mostly happen if there is multiple concurrent requests right at the start of the HTTP connection.

@tsloughter
Copy link
Owner

Yea, I think it may be fixable with not too much work but it will come at the cost of performance, I just don't know how much. Basically I'm trying to change it to call the connection process to encode and send the trailers where it would normally be sending the trailers on the socket directly. So instead of encoding the trailers and then adding to frames to be sent it adds them as they are (not as frames, so no encoding) and waits to encode until it is sending them, which has to be done in the connection process to update the encode context.

@tsloughter
Copy link
Owner

Seems I got it. All grpcbox and chatterbox tests passing. It is a little ridiculous -- I added an actually_send_trailers function, since send_trailers meant to set what trailers to send once a streams body is done and a function that would actually encode and send the trailers (called by the stream once its body is done) had to be added.

I haven't traced out all the stream state possibilities but I think it is fairly simple since trailers means the stream is closing (I hope!). Will open a PR this morning.

@tsloughter
Copy link
Owner

Newest attempt: tsloughter/chatterbox#5

@psalin
Copy link
Collaborator Author

psalin commented Jan 26, 2021

This looks nice, just deferring header encoding until its about to send. So I guess performance-wise the main difference is that there is an extra step and that trailers are now stored in the stream set as unencoded, I guess that shouldn't have any big impact on performance?

@tsloughter
Copy link
Owner

Yea, I don't think it should have much of an impact on performance.

@psalin
Copy link
Collaborator Author

psalin commented Jan 29, 2021

Sounds good and looks all good to me!

As an unrelated side note, could you add the tag v0.6.0 for ctx? Seems the version 0.6.0 is available in hex but not tagged in the repo and I think we'd need that tag in order to clear licensing on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants