Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection: propagate I/O errors to network filters. #13941

Closed

Conversation

PiotrSikora
Copy link
Contributor

Signed-off-by: Piotr Sikora [email protected]

Copy link
Contributor Author

@PiotrSikora PiotrSikora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my attempt at a fix for #13939. It's still work-in-progress, so errors are caught only in the raw_buffer_socket and there are no new tests, but I'd like to get feedback on the general direction.

cc @alyssawilk @antoniovicente @ggreenway @kyessenov

stream_info_.setUpstreamTransportFailureReason(kUpstreamConnectionTerminationDetails);
stream_info_.setConnectionTerminationDetails(kUpstreamConnectionTerminationDetails);
stream_info_.setResponseFlag(StreamInfo::ResponseFlag::UpstreamConnectionTermination);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those values are set on the upstream connection's stream info, and they are never propagated to downstream connection's stream info or access logs. I'm not sure what's the best way to solve that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the high level request object query the connection as part of the termination process to determine if termination was graceful or abrupt?

Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of distinguishing between closing due to error, or graceful.

For the transport socket signaling, one idea is to have another PostIoAction. But I'm not entirely sure that will make the interface any cleaner.

Should this change also include a new ConnectionEvent to send to the network filters?

@antoniovicente antoniovicente self-assigned this Nov 9, 2020
@PiotrSikora
Copy link
Contributor Author

I'm in favor of distinguishing between closing due to error, or graceful.

For the transport socket signaling, one idea is to have another PostIoAction. But I'm not entirely sure that will make the interface any cleaner.

Something like the updated commit?

Should this change also include a new ConnectionEvent to send to the network filters?

The event is propagated to network filters via end_stream=true, and network filters can (well, once I figure out how to approach the downstream/upstream stream info issue - any thoughts?) learn about the error via streamInfo()->getResponseFlags().

Right now, there is no way to dispatch ConnectionEvent about upstream connection to network filters, and that's much bigger change than this (see: #13940), so I'd prefer not to mix them together to avoid unnecessary delays.

@PiotrSikora
Copy link
Contributor Author

(Note: a few tests are failing because unmocked expectations for set response flags, etc., so please ignore them for now)

Signed-off-by: Piotr Sikora <[email protected]>
@PiotrSikora PiotrSikora marked this pull request as ready for review November 10, 2020 02:48
@yanavlasov yanavlasov self-requested a review November 10, 2020 21:04
@yanavlasov yanavlasov self-assigned this Nov 10, 2020
Copy link
Contributor

@antoniovicente antoniovicente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting PR. I guess my feedback boils down to: Does it make sense to have both graceful and error close PostIoActions? A majority of the Close PostIoActions are clearly errors. The rest may also be errors.

@@ -176,12 +176,12 @@ Network::IoResult TsiSocket::doRead(Buffer::Instance& buffer) {
}

end_stream_read_ = result.end_stream_read_;
read_error_ = result.action_ == Network::PostIoAction::Close;
read_error_ = result.action_ == Network::PostIoAction::CloseError;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:
result.action_ != Network::PostIoAction::KeepOpen;

or consider renaming read_error_ to got_close_error_

I guess one thing to keep in mind is that as of this PR RawBufferSocket can only return KeepOpen or CloseError from doRead. Should we consider all Close PostIoActions as errors?

KeepOpen
KeepOpen,
// Close the connection because of an error.
CloseError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that a majority of Close cases are being renamed CloseError.

Should we also rename Close to CloseGraceful or similar as a way to improve our chances of catching all cases that need to be adjusted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

include/envoy/network/post_io_action.h Show resolved Hide resolved
stream_info_.setUpstreamTransportFailureReason(kUpstreamConnectionTerminationDetails);
stream_info_.setConnectionTerminationDetails(kUpstreamConnectionTerminationDetails);
stream_info_.setResponseFlag(StreamInfo::ResponseFlag::UpstreamConnectionTermination);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the high level request object query the connection as part of the termination process to determine if termination was graceful or abrupt?

} else {
stream_info_.setUpstreamTransportFailureReason(kUpstreamConnectionTerminationDetails);
stream_info_.setConnectionTerminationDetails(kUpstreamConnectionTerminationDetails);
stream_info_.setResponseFlag(StreamInfo::ResponseFlag::UpstreamConnectionTermination);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: delegate stats update to virtual method so you can avoid the dynamic cast.

stream_info_.setResponseFlag(StreamInfo::ResponseFlag::UpstreamConnectionTermination);
}
// Force "end_stream" so that filters can process this error.
result.end_stream_read_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually associate end_stream_read_ with graceful termination. Is it appropriate to set this? Do we need alternate signaling to notify filters about abrupt terminations via either additional arguments to push pipeline or a new virtual method which is called on abrupt termination.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; I think of end_stream as always a graceful operation. I think a different signal for error is more appropriate.

Copy link
Contributor Author

@PiotrSikora PiotrSikora Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that there is no different signal right now, and if we don't set this, then network filters won't be executed at all.

Note that this PR is meant to be a temporary fix until we have a proper solution for #13940, but that's going to require a lot more changes and time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a massive hack. How much effort would it be to implement the real fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on what do you consider a real fix. I'm not too familiar with that part of the codebase, but if we want to propagate upstream connection events to network filters, then I imagine it's quite a lot of changes, so I think it would be better if one of maintainers could work on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggreenway

I don't know what to suggest. It would be good to figure out how to do a catch-all cleanup of WASM resources on abnormal connection termination. Could WASM detect connection termination based on an L4 filter that does the relevant signaling on destruction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we have a workaround in #13836 that works for Wasm.

@@ -110,7 +110,7 @@ Network::IoResult SslSocket::doRead(Buffer::Instance& read_buffer) {
if (info_->state() != Ssl::SocketState::HandshakeComplete &&
info_->state() != Ssl::SocketState::ShutdownSent) {
PostIoAction action = doHandshake();
if (action == PostIoAction::Close || info_->state() != Ssl::SocketState::HandshakeComplete) {
if (action != PostIoAction::KeepOpen || info_->state() != Ssl::SocketState::HandshakeComplete) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also change post actions in NotReadySslSocket to CloseError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -145,7 +145,7 @@ Network::PostIoAction TsiSocket::doHandshakeNextDone(NextResultPtr&& next_result
if (read_error_ || (!handshake_complete_ && end_stream_read_)) {
ENVOY_CONN_LOG(debug, "TSI: Handshake failed: end of stream without enough data",
callbacks_->connection());
return Network::PostIoAction::Close;
return Network::PostIoAction::CloseError;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 remaining uses of Network::PostIoAction::Close; in this file when finding a read 0 during handshakes. Would it make sense to also consider those cases CloseError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed one to CloseError and one to CloseGraceful (when the certificate validation failed, since that's not and an I/O error).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggreenway

I guess the question is wherever or it is worth having the extra complexity that comes from having 2 close types when out of the 3 remaining uses 2 of them are handshake errors. I guess it's fine to use CloseGraceful for the 3 remaining cases.

@@ -242,7 +242,7 @@ Network::PostIoAction SslHandshakerImpl::doHandshake() {
return PostIoAction::KeepOpen;
default:
handshake_callbacks_->onFailure();
return PostIoAction::Close;
return PostIoAction::CloseError;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the PostIoAction::Close on line 233 above a graceful or error close?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graceful.

@@ -2699,7 +2699,7 @@ TEST_P(SslSocketTest, ShutdownWithoutCloseNotify) {
Buffer::OwnedImpl data("hello");
server_connection->write(data, false);
EXPECT_EQ(data.length(), 0);
// Calling close(FlushWrite) in onEvent() callback results in PostIoAction::Close,
// Calling close(FlushWrite) in onEvent() callback results in PostIoAction::CloseGraceful,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing some failures in //test/extensions/transport_sockets/tls:ssl_socket_test, examples:

2020-11-14T01:05:04.6013231Z Uninteresting mock function call - taking default action specified at:
2020-11-14T01:05:04.6013980Z test/mocks/stream_info/mocks.cc:30:
2020-11-14T01:05:04.6014503Z Function call: setConnectionTerminationDetails("downstream connection was terminated")

2020-11-14T01:05:04.6024240Z unknown file: Failure
2020-11-14T01:05:04.6025476Z Uninteresting mock function call - taking default action specified at:
2020-11-14T01:05:04.6026043Z test/mocks/stream_info/mocks.cc:24:
2020-11-14T01:05:04.6026487Z Function call: setResponseFlag(16384)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, see my previous comment - I didn't add mocks for them, since it's a lot of extra noise in PR for code that was still being discussed. Also, I'm not sure if all 3 values (connection termination details, upstream transport socket failure and response flags) should be set.

stream_info_.setResponseFlag(StreamInfo::ResponseFlag::UpstreamConnectionTermination);
}
// Force "end_stream" so that filters can process this error.
result.end_stream_read_ = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggreenway

I don't know what to suggest. It would be good to figure out how to do a catch-all cleanup of WASM resources on abnormal connection termination. Could WASM detect connection termination based on an L4 filter that does the relevant signaling on destruction?

@@ -145,7 +145,7 @@ Network::PostIoAction TsiSocket::doHandshakeNextDone(NextResultPtr&& next_result
if (read_error_ || (!handshake_complete_ && end_stream_read_)) {
ENVOY_CONN_LOG(debug, "TSI: Handshake failed: end of stream without enough data",
callbacks_->connection());
return Network::PostIoAction::Close;
return Network::PostIoAction::CloseError;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggreenway

I guess the question is wherever or it is worth having the extra complexity that comes from having 2 close types when out of the 3 remaining uses 2 of them are handshake errors. I guess it's fine to use CloseGraceful for the 3 remaining cases.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 17, 2020
@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Dec 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants