Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for read interest when channel is unwritable while HttpClient.send(Mono) is used #2864

Merged
merged 14 commits into from
Sep 15, 2023

Conversation

pderop
Copy link
Contributor

@pderop pderop commented Jul 25, 2023

Motivation:

When using HttpClient.send with a Mono publisher and plain POST HTTP/1.1, a problem may occur if the channel becomes unwritable during request flushing. If the remote server sends an early response (e.g., 400 bad request) and closes the connection without reading the request post data, the client might fail with a "Connection prematurely closed BEFORE response" error instead of reporting the actual early server response (e.g., 400 bad request).

This issue arises because, in the case of a Mono publisher and HTTP/1.1 plain, the channel read interest is not enabled when the channel becomes unwritable, preventing the client from reading the response and leading to "Connection prematurely closed BEFORE response" errors.

For example, in GH #2825, the user has provided a reproducer example, where the server (port 8000) sends a 400 bad request to the client which is still writing a large POST HTTP/1.1 request data (the client is often blocked while writing because of TCP flow control). In the following wireshark, the frame 602 is the 400 bad request returned to the user. Notice that the 400 bad request has been sent in two frames (in frame 598, the client first gets the 400 bad request headers in chunk, followed by the body, and in frame 602, we get the last zero-length chunk header that is delimiting the end of the message):

Screenshot 2023-07-26 at 21 15 56

Then in frame 603, the server receives the TCP/ACK highlighted with a [TCP Window Full] info, meaning that the client is using the full capacity of the TCP flow, limited by the server's receive window which is full.

Finally, a TCP/RST is then sent to the client in frame 767, but the client has not requested for read() interest, so it misses the 400 bad request, and then aborts with Connection prematurely closed BEFORE response errors:

Screenshot 2023-07-26 at 21 16 21

By requesting for read interests while writing the big POST data request, we can alleviate the issue and be able to report the 400 bad request to the user instead of misleading Connection prematurely closed BEFORE response

Caution: This PR may alleviate the issue, but sometimes, it may not avoid the problem because TCP/RST is not graceful and some packets may not be handled by the client. For example, if the 400 bad request is sent in chunked encoding and is segmented in 3 frames (1st with the headers, 2nd with the the body, and the 3rd with the last zero-length chunk), then the last 3rd frame with the last chunk EOF may be missed, even if we see it from wireshark on the client machine ...). In this case, we may get a Connection prematurely closed DURING response error on the client. This we sometimes won't be able to avoid it (because of the nature of the TCP/RST).

For example, with curl, we usually always get the 400 bad request, but if we manage to let the server send the 400 bad request in three frames (the headers, the body, and the last chunk), then sometimes curls will fail like this:

curl -v --header "Expect:"  -H "Content-Type: text/plain" -d @sample_request.json http://xxxx.xxxx.xxxx.xxxx:8000/payload-size
> POST /large-payload HTTP/1.1
> Host: xxxx.xxxx.xxxx.xxxx:8000
> User-Agent: curl/7.88.1
> Accept: */*
> Content-Type: text/plain
> Content-Length: 1561039
> 
* Recv failure: Connection reset by peer
* Closing connection 0
curl: (56) Recv failure: Connection reset by peer

But most of the time, curl is able to get the 400 bad request response:

In the context of Tomcat (used by the reproducer, the 400 bad request is sent in two frames: first for the headers+body, second one for the last chunk header.

Modifications:

Requesting for read interest when HttpClient.send method is used with a Mono publisher seems to fix the issue, and we can now get the early 400 bad request server response instead of the unexpected Connection prematurely closed BEFORE response error.

Now, read interest could be systematically enabled, like this in the HttpOperations.send(Publisher) method:

	public NettyOutbound send(Publisher<? extends ByteBuf> source) {
		if (!channel().isActive()) {
			return then(Mono.error(AbortedException.beforeSend()));
		}
		if (source instanceof Mono) {
			return new PostHeadersNettyOutbound(((Mono<ByteBuf>) source)
					.flatMap(b -> {
						if (markSentHeaderAndBody(b)) {
							HttpMessage msg = prepareHttpMessage(b);

							try {
								afterMarkSentHeaders();
							}
							catch (RuntimeException e) {
								ReferenceCountUtil.release(b);
								return Mono.error(e);
							}
->                                                      channel.read();
							return FutureMono.from(channel().writeAndFlush(msg));
						}
						return FutureMono.from(channel().writeAndFlush(b));
					})
					.doOnDiscard(ByteBuf.class, ByteBuf::release), this, null);
		}
		return super.send(source);
	}

Instead of doing a read() systematically, this PR uses a different approach and only requests for channel read() if the channel becomes unwritable: The ChannelOperationsHandler is now overriding the channelWritabilityChanged callback, which delegates to any registered ChannelOperation (there is also an empty default onWritabilityChanged method that has been added in ChannelOperations). So, the onWritabilityChanged is only implemented by the HttpClientOperations class, and when the channel becomes unwritable, it will then invoke channel().read(), but only if the request has been sent using HttpClient.send(Mono<ByteBuf>), and plain HTT/1.1 is used. I do not think it is needed to do the same when HTTPS, H2C, HTTP2, or WebSocket are used.

In order to detect if HttpClient.send(Mono<ByteBuf>) has been used, the PR relies on the HttpOperations.hasSentBody() new method. Please check the important javadoc done on top of HttpClientOperations.onWritabilityChanged()

There is a last issue: in AbstractHttpClientMetricsHandler, since now it's possible that we receive an early response before the corresponding request has been fully flushed, care must be taken, because once channelRead() receives a full response, it will call recordRead(), and after it will call reset(),. So that's a problem if the promise listener of the write method completes later, because it will then call recordWrite() but at this point, all class fields will have already been cleared by the reset() previously called by channelRead().
To handle such problem: some sequence numbers are now used in order to let the channelRead() detect if the full write has not yet completed when a full response is received. In this case channelRead() will call itself recordWrite() on behalf of the write method, and then recordRead() is called, and then reset().

Added two tests:

Fixes #2825

@pderop pderop added the type/bug A general bug label Jul 25, 2023
@pderop pderop added this to the 1.0.35 milestone Jul 25, 2023
@pderop pderop marked this pull request as draft July 25, 2023 19:45
@pderop pderop force-pushed the 1.0.x-gh-2825 branch 4 times, most recently from 8275e76 to 168eef3 Compare July 27, 2023 08:45
@pderop
Copy link
Contributor Author

pderop commented Jul 27, 2023

I don't think the windows test failure is related to this PR.

@pderop pderop marked this pull request as ready for review July 27, 2023 09:56
@pderop
Copy link
Contributor Author

pderop commented Jul 27, 2023

@violetagg ,

When you will be available (this is not urgent), can you take a look ?
thanks.

@pderop
Copy link
Contributor Author

pderop commented Sep 1, 2023

I have updated the HttpClientTest in order to also test HTTP/1.1 plain protocol.

@pderop
Copy link
Contributor Author

pderop commented Sep 1, 2023

@violetagg ,

the checks are failing on windows , let me verify that before reviewing ...

@pderop pderop marked this pull request as draft September 1, 2023 15:53
@pderop pderop force-pushed the 1.0.x-gh-2825 branch 5 times, most recently from 490ef0b to 72e565b Compare September 4, 2023 11:19
@pderop
Copy link
Contributor Author

pderop commented Sep 4, 2023

rebased on top of 1.0.x, in order to pick up #2892

@pderop pderop modified the milestones: 1.0.36, 1.0.37 Sep 4, 2023
@pderop pderop force-pushed the 1.0.x-gh-2825 branch 2 times, most recently from ca06e10 to 5957147 Compare September 9, 2023 09:25
@pderop
Copy link
Contributor Author

pderop commented Sep 11, 2023

After some research, it appears that the tests from this PR that were unstable so far were using a reactor netty http server. But this is a different use case than the reproducer project provided from #2825, which was based on Tomcat.

When a request is aborted, Tomcat continues to read request body bytes up to two mega bytes once a final response (400) is sent, and this allows the reactor netty client to have more times to be able to see the 400 bad request (using the patch from this PR). See maxSwallowSize in Tomcat documentation.

Now, when using reactor netty http server, the issue is that the connection is closed right after the 400 bad request is sent, and on localhost, the tests may be unstable because TCP/RST may be sent to the client, which then may miss the 400 bad request.

I have removed all reactor netty server based tests, and I have only left the Tomcat test that is using HTTP/1.1 plain, and the tests are now stable.

Turning this PR to ready for review (in the latest checks, I don't think that the errors from the Windows matrix are related).

@pderop pderop marked this pull request as ready for review September 11, 2023 15:42
@violetagg
Copy link
Member

Please fix the checkstyle warning

> Task :reactor-netty-http:checkstyleMain
/home/runner/work/reactor-netty/reactor-netty/reactor-netty-http/src/test/java/reactor/netty/http/client/HttpClientWithTomcatTest.java:75: warning: [UnnecessaryParentheses] These grouping parentheses are unnecessary; it is unlikely the code will be misinterpreted without them
	private static final byte[] PAYLOAD = String.join("", Collections.nCopies((TomcatServer.PAYLOAD_MAX) + (1024 * 1024), "X"))
	                                                                          ^
    (see https://errorprone.info/bugpattern/UnnecessaryParentheses)
  Did you mean 'private static final byte[] PAYLOAD = String.join("", Collections.nCopies(TomcatServer.PAYLOAD_MAX + (1024 * 1024), "X"))'?

@pderop
Copy link
Contributor Author

pderop commented Sep 14, 2023

Fixed checkstyle warning in 69c0b14

@pderop
Copy link
Contributor Author

pderop commented Sep 14, 2023

I have applied your feedbacks (thanks), can you check ?

@pderop
Copy link
Contributor Author

pderop commented Sep 15, 2023

@violetagg , thanks for the review.

@pderop pderop merged commit 50b24b3 into reactor:1.0.x Sep 15, 2023
@pderop pderop deleted the 1.0.x-gh-2825 branch September 15, 2023 06:45
pderop added a commit that referenced this pull request Sep 15, 2023
pderop added a commit that referenced this pull request Sep 15, 2023
pderop added a commit that referenced this pull request Sep 22, 2023
Fixed flaky test in HttpClientWithTomcatTest.testIssue2825.

To reproduce the issue, we need Tomcat to close the connection while the client is still writing. This flakiness occurs because Tomcat closes the connection without reading all remaining data. Depending on the unread data’s size, it may result in TCP sending a TCP/RST instead of a FIN. When the client receives TCP/RST, some or all unread data may be dropped.

So, the socket send buffer size in HttpClient has been reduced, which eliminated the flakiness of the test and most of TCP/RST. Additionally, returning a 400 bad request without chunk encoding reduces the chance of losing data, as it sends only one TCP segment (compared to two segments with chunk encoding). These workarounds seem to fix the instability of the test, and if the patch is disabled, the PrematureCloseException reliably reoccurs with the test. I also removed the retries, the tests are running in around 1,5-2 seconds.

The test for the case when HttpClient sends the request using Flux has been removed, because it seems unstable, and maybe it's a different problem, which must be addressed in a different issue.

Related to #2864 #2825
@violetagg violetagg changed the title Request for read interest when channel is unwritable while HttpClient.send(Mono) is used Request for read interest when channel is unwritable while HttpClient.send(Mono) is used Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug A general bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants