-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connection: propagate I/O errors to network filters. #13941
connection: propagate I/O errors to network filters. #13941
Conversation
Signed-off-by: Piotr Sikora <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my attempt at a fix for #13939. It's still work-in-progress, so errors are caught only in the raw_buffer_socket
and there are no new tests, but I'd like to get feedback on the general direction.
stream_info_.setUpstreamTransportFailureReason(kUpstreamConnectionTerminationDetails); | ||
stream_info_.setConnectionTerminationDetails(kUpstreamConnectionTerminationDetails); | ||
stream_info_.setResponseFlag(StreamInfo::ResponseFlag::UpstreamConnectionTermination); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those values are set on the upstream connection's stream info, and they are never propagated to downstream connection's stream info or access logs. I'm not sure what's the best way to solve that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the high level request object query the connection as part of the termination process to determine if termination was graceful or abrupt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of distinguishing between closing due to error, or graceful.
For the transport socket signaling, one idea is to have another PostIoAction
. But I'm not entirely sure that will make the interface any cleaner.
Should this change also include a new ConnectionEvent
to send to the network filters?
Signed-off-by: Piotr Sikora <[email protected]>
Something like the updated commit?
The event is propagated to network filters via Right now, there is no way to dispatch |
(Note: a few tests are failing because unmocked expectations for set response flags, etc., so please ignore them for now) |
Signed-off-by: Piotr Sikora <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting PR. I guess my feedback boils down to: Does it make sense to have both graceful and error close PostIoActions? A majority of the Close PostIoActions are clearly errors. The rest may also be errors.
@@ -176,12 +176,12 @@ Network::IoResult TsiSocket::doRead(Buffer::Instance& buffer) { | |||
} | |||
|
|||
end_stream_read_ = result.end_stream_read_; | |||
read_error_ = result.action_ == Network::PostIoAction::Close; | |||
read_error_ = result.action_ == Network::PostIoAction::CloseError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be:
result.action_ != Network::PostIoAction::KeepOpen;
or consider renaming read_error_ to got_close_error_
I guess one thing to keep in mind is that as of this PR RawBufferSocket can only return KeepOpen or CloseError from doRead. Should we consider all Close PostIoActions as errors?
KeepOpen | ||
KeepOpen, | ||
// Close the connection because of an error. | ||
CloseError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that a majority of Close cases are being renamed CloseError.
Should we also rename Close to CloseGraceful or similar as a way to improve our chances of catching all cases that need to be adjusted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
stream_info_.setUpstreamTransportFailureReason(kUpstreamConnectionTerminationDetails); | ||
stream_info_.setConnectionTerminationDetails(kUpstreamConnectionTerminationDetails); | ||
stream_info_.setResponseFlag(StreamInfo::ResponseFlag::UpstreamConnectionTermination); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the high level request object query the connection as part of the termination process to determine if termination was graceful or abrupt?
} else { | ||
stream_info_.setUpstreamTransportFailureReason(kUpstreamConnectionTerminationDetails); | ||
stream_info_.setConnectionTerminationDetails(kUpstreamConnectionTerminationDetails); | ||
stream_info_.setResponseFlag(StreamInfo::ResponseFlag::UpstreamConnectionTermination); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: delegate stats update to virtual method so you can avoid the dynamic cast.
stream_info_.setResponseFlag(StreamInfo::ResponseFlag::UpstreamConnectionTermination); | ||
} | ||
// Force "end_stream" so that filters can process this error. | ||
result.end_stream_read_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually associate end_stream_read_ with graceful termination. Is it appropriate to set this? Do we need alternate signaling to notify filters about abrupt terminations via either additional arguments to push pipeline or a new virtual method which is called on abrupt termination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; I think of end_stream as always a graceful operation. I think a different signal for error is more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that there is no different signal right now, and if we don't set this, then network filters won't be executed at all.
Note that this PR is meant to be a temporary fix until we have a proper solution for #13940, but that's going to require a lot more changes and time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a massive hack. How much effort would it be to implement the real fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends on what do you consider a real fix. I'm not too familiar with that part of the codebase, but if we want to propagate upstream connection events to network filters, then I imagine it's quite a lot of changes, so I think it would be better if one of maintainers could work on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what to suggest. It would be good to figure out how to do a catch-all cleanup of WASM resources on abnormal connection termination. Could WASM detect connection termination based on an L4 filter that does the relevant signaling on destruction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, we have a workaround in #13836 that works for Wasm.
@@ -110,7 +110,7 @@ Network::IoResult SslSocket::doRead(Buffer::Instance& read_buffer) { | |||
if (info_->state() != Ssl::SocketState::HandshakeComplete && | |||
info_->state() != Ssl::SocketState::ShutdownSent) { | |||
PostIoAction action = doHandshake(); | |||
if (action == PostIoAction::Close || info_->state() != Ssl::SocketState::HandshakeComplete) { | |||
if (action != PostIoAction::KeepOpen || info_->state() != Ssl::SocketState::HandshakeComplete) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also change post actions in NotReadySslSocket to CloseError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -145,7 +145,7 @@ Network::PostIoAction TsiSocket::doHandshakeNextDone(NextResultPtr&& next_result | |||
if (read_error_ || (!handshake_complete_ && end_stream_read_)) { | |||
ENVOY_CONN_LOG(debug, "TSI: Handshake failed: end of stream without enough data", | |||
callbacks_->connection()); | |||
return Network::PostIoAction::Close; | |||
return Network::PostIoAction::CloseError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 remaining uses of Network::PostIoAction::Close; in this file when finding a read 0 during handshakes. Would it make sense to also consider those cases CloseError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed one to CloseError
and one to CloseGraceful
(when the certificate validation failed, since that's not and an I/O error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the question is wherever or it is worth having the extra complexity that comes from having 2 close types when out of the 3 remaining uses 2 of them are handshake errors. I guess it's fine to use CloseGraceful for the 3 remaining cases.
@@ -242,7 +242,7 @@ Network::PostIoAction SslHandshakerImpl::doHandshake() { | |||
return PostIoAction::KeepOpen; | |||
default: | |||
handshake_callbacks_->onFailure(); | |||
return PostIoAction::Close; | |||
return PostIoAction::CloseError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the PostIoAction::Close on line 233 above a graceful or error close?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graceful.
Signed-off-by: Piotr Sikora <[email protected]>
Signed-off-by: Piotr Sikora <[email protected]>
@@ -2699,7 +2699,7 @@ TEST_P(SslSocketTest, ShutdownWithoutCloseNotify) { | |||
Buffer::OwnedImpl data("hello"); | |||
server_connection->write(data, false); | |||
EXPECT_EQ(data.length(), 0); | |||
// Calling close(FlushWrite) in onEvent() callback results in PostIoAction::Close, | |||
// Calling close(FlushWrite) in onEvent() callback results in PostIoAction::CloseGraceful, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing some failures in //test/extensions/transport_sockets/tls:ssl_socket_test, examples:
2020-11-14T01:05:04.6013231Z Uninteresting mock function call - taking default action specified at:
2020-11-14T01:05:04.6013980Z test/mocks/stream_info/mocks.cc:30:
2020-11-14T01:05:04.6014503Z Function call: setConnectionTerminationDetails("downstream connection was terminated")
2020-11-14T01:05:04.6024240Z unknown file: Failure
2020-11-14T01:05:04.6025476Z Uninteresting mock function call - taking default action specified at:
2020-11-14T01:05:04.6026043Z test/mocks/stream_info/mocks.cc:24:
2020-11-14T01:05:04.6026487Z Function call: setResponseFlag(16384)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, see my previous comment - I didn't add mocks for them, since it's a lot of extra noise in PR for code that was still being discussed. Also, I'm not sure if all 3 values (connection termination details, upstream transport socket failure and response flags) should be set.
stream_info_.setResponseFlag(StreamInfo::ResponseFlag::UpstreamConnectionTermination); | ||
} | ||
// Force "end_stream" so that filters can process this error. | ||
result.end_stream_read_ = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what to suggest. It would be good to figure out how to do a catch-all cleanup of WASM resources on abnormal connection termination. Could WASM detect connection termination based on an L4 filter that does the relevant signaling on destruction?
@@ -145,7 +145,7 @@ Network::PostIoAction TsiSocket::doHandshakeNextDone(NextResultPtr&& next_result | |||
if (read_error_ || (!handshake_complete_ && end_stream_read_)) { | |||
ENVOY_CONN_LOG(debug, "TSI: Handshake failed: end of stream without enough data", | |||
callbacks_->connection()); | |||
return Network::PostIoAction::Close; | |||
return Network::PostIoAction::CloseError; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the question is wherever or it is worth having the extra complexity that comes from having 2 close types when out of the 3 remaining uses 2 of them are handshake errors. I guess it's fine to use CloseGraceful for the 3 remaining cases.
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
Signed-off-by: Piotr Sikora [email protected]