Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use QuicheMemSlice constructor with releasor. #31167

Merged
merged 10 commits into from
Dec 12, 2023

Conversation

steveWang
Copy link
Contributor

@steveWang steveWang commented Dec 4, 2023

The existing model of forwarding to the quiche::MemSliceImpl constructor is an abstraction violation and makes it hard to swap out the underlying MemSlice platform implementation. As such, we've recently added a new constructor to QuicheMemSlice (and the Impl API) that allows for an arbitrary custom releasor.

This PR uses the new constructor, guarded behind the runtime flag envoy.reloadable_features.quiche_use_mem_slice_releasor_api (disabled by default).

Since we're storing a unique_ptr in the capture list (allowable since C++17), we can't store this in a std::function which requires copyability. As such, we use absl::AnyInvocable.

(I've marked this as medium risk since it interacts with Envoy's memory management model for QUIC streams and so deserves some scrutiny.)

Commit Message: Use QuicheMemSlice constructor with releasor.
Additional Description:
Risk Level: Medium
Testing: This is a refactor, so it should be covered by existing tests (test/common/buffer:buffer_test, test/common/quic:envoy_quic_{client,server}_stream_test)
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

The existing model of forwarding to the quiche::MemSliceImpl
constructor is an abstraction violation and makes it hard to swap out
the underlying MemSlice platform implementation. As such, we've
recently added a new constructor to QuicheMemSlice (and the Impl API)
that allows for an arbitrary custom releasor.

This commit is a little wonky in that it only really uses a lambda
releasor for its capture list to manage object lifetime, and so the
lambda body is intentionally empty.

Since we're storing a unique_ptr in the capture list (allowable since
C++17), we can't store this in a std::function which requires
copyability. As such, we use absl::AnyInvocable.

Signed-off-by: Steve Wang <[email protected]>
@steveWang steveWang marked this pull request as draft December 4, 2023 18:08
The custom deleter's lifetime is bound to the BufferFragmentImpl, not
the

Apparently when I provided the sample Envoy implementation, I had
inadvertently reverted the buffer_impl changes to use
absl::AnyInvocable, which were necessary to make this compile...

Signed-off-by: Steve Wang <[email protected]>
@steveWang steveWang marked this pull request as ready for review December 4, 2023 20:36
@steveWang
Copy link
Contributor Author

There is one failing test (test/integration/quic_protocol_integration_test) which seems to be a segfault when closing QUIC connections. Given that the focus of this PR involves QUIC connection lifetime management, I would not be surprised if it's a breakage that I need to track down.

@soulxu
Copy link
Member

soulxu commented Dec 5, 2023

/assign @danzh2010

@danzh2010
Copy link
Contributor

There is one failing test (test/integration/quic_protocol_integration_test) which seems to be a segfault when closing QUIC connections. Given that the focus of this PR involves QUIC connection lifetime management, I would not be surprised if it's a breakage that I need to track down.

The crash stack might be a test setup issue irrelevant to this PR, but I couldn't repro it locally under opt build with ASAN in a clean branch. If you are able to repro and need help to look into the debug log, I'd be happy to help.

@steveWang
Copy link
Contributor Author

steveWang commented Dec 5, 2023

I couldn't reproduce it running locally in this branch with --compilation_mode=opt, but I did look at the stack trace in the test logs (quoted below for convenience). It didn't obviously seem related to this PR, but it still makes me nervous.

stack_decode.py didn't seem to consistently identify line numbers, even when building with --config=-gmlt to preserve debug symbols (after the first two lines, it started claiming Envoy code was coming from things like <vector>).

[critical][backtrace] [./source/server/backtrace.h:96] #1: Envoy::Http::CodecClient::onEvent() [0x15c35cf]
[critical][backtrace] [./source/server/backtrace.h:96] #2: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x19c595e]
[critical][backtrace] [./source/server/backtrace.h:96] #3: Envoy::Quic::QuicFilterManagerConnectionImpl::onConnectionCloseEvent() [0x1867dd8]
[critical][backtrace] [./source/server/backtrace.h:96] #4: Envoy::Quic::EnvoyQuicClientSession::OnConnectionClosed() [0x17d8f35]
[critical][backtrace] [./source/server/backtrace.h:96] #5: quic::QuicConnection::TearDownLocalConnectionState() [0x18d377f]
[critical][backtrace] [./source/server/backtrace.h:96] #6: quic::QuicConnection::TearDownLocalConnectionState() [0x18d52ee]
[critical][backtrace] [./source/server/backtrace.h:96] #7: Envoy::Quic::QuicFilterManagerConnectionImpl::closeConnectionImmediately() [0x186831b]
[critical][backtrace] [./source/server/backtrace.h:96] #8: Envoy::HttpIntegrationTest::cleanupUpstreamAndDownstream() [0xc94259]
[critical][backtrace] [./source/server/backtrace.h:96] #9: Envoy::HttpIntegrationTest::~HttpIntegrationTest() [0xc93e20]
[critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::DownstreamProtocolIntegrationTest_AddInvalidEncodedData_Test::~DownstreamProtocolIntegrationTest_AddInvalidEncodedData_Test() [0xc38602]
[critical][backtrace] [./source/server/backtrace.h:96] #11: testing::internal::HandleExceptionsInMethodIfSupported<>() [0x2081ecc]
[critical][backtrace] [./source/server/backtrace.h:96] #12: testing::TestInfo::Run() [0x2083014]
[critical][backtrace] [./source/server/backtrace.h:96] #13: testing::TestSuite::Run() [0x2083d69]
[critical][backtrace] [./source/server/backtrace.h:96] #14: testing::internal::UnitTestImpl::RunAllTests() [0x209129f]
[critical][backtrace] [./source/server/backtrace.h:96] #15: testing::internal::HandleExceptionsInMethodIfSupported<>() [0x2090d0c]
[critical][backtrace] [./source/server/backtrace.h:96] #16: testing::UnitTest::Run() [0x2090b8f]
[critical][backtrace] [./source/server/backtrace.h:96] #17: Envoy::TestRunner::runTests() [0x1956ed0]
[critical][backtrace] [./source/server/backtrace.h:96] #18: main [0x1955a9e]

@@ -147,7 +147,11 @@ void EnvoyQuicClientStream::encodeData(Buffer::Instance& data, bool end_stream)
// TODO(danzh): investigate the cost of allocating one buffer per slice.
// If it turns out to be expensive, add a new function to free data in the middle in buffer
// interface and re-design QuicheMemSliceImpl.
quic_slices.emplace_back(quiche::QuicheMemSlice::InPlace(), data, slice.len_);
auto single_slice_buffer = std::make_unique<Buffer::OwnedImpl>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change I would love a way to fallback in case it goes wrong. RUNTIME_GUARD isn't an option here because we want to compile out the old way internally. Would #ifdef SOME_PREPROCESSOR be acceptable? @alyssawilk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to compile out the old way overnight? Can we do the usual flip-wait-deprecate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flip-wait-deprecate takes 6 months, and preferably we don't want to wait for that long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah so for functional changes you need to wait 6 months for operators to update and test, but for "I want to guard this in case it causes crashes but really things look fine" changes you can remove at your own pace. Does that help any?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Dan mentions, we'd like to be able to change the compile-time dependencies (in particular so we no longer depend on a custom constructor in Envoy platform's QuicheMemSliceImpl).

I'm happy to leave the default behavior as-is and protect this new behavior behind a compile-time guard so everyone who doesn't care about swapping out this dependency is unaffected, but guarding it behind a runtime flag means we can't actually swap out the underlying implementation until the old way is removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how we get a fallback and keep the code in, so I was trying to offer a less painful timeline to get both in sequence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I don't think a runtime fallback is feasible for what I want to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RyanTheOptimist FYI. I think we want to use a runtime guard with an ifdef. Something like...

#ifdef USE_QUICHE_MEM_SLICE_RELEASOR_API_EXCLUSIVELY
if (!RUNTIME_GUARD(use_quiche_mem_slice_releasor)) {
  quic_slices.emplace_back(quiche::QuicheMemSlice::InPlace(), data, slice.len_);
} else {
#endif
  auto single_slice_buffer = std::make_unique<Buffer::OwnedImpl>();
  // ...
#ifdef USE_QUICHE_MEM_SLICE_RELEASOR_API_EXCLUSIVELY
}
#endif

(Rough pseudocode. I'll send a commit along these lines later today.)

We want to be able to compile out the existing API. Once we've put
some mileage on this implementation, we can set this to be default
true for other Envoy users.

Signed-off-by: Steve Wang <[email protected]>
Copy link

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #31167 was synchronize by steveWang.

see: more, trace.

We'll manage this with a short-term patch for our internal import.

Signed-off-by: Steve Wang <[email protected]>
Copy link
Contributor

@RyanTheOptimist RyanTheOptimist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please mention the runtime guard in the PR description.

Did we get to the bottom of the test failures that were seen earlier?

single_slice_buffer->move(data, slice.len_);
quic_slices.emplace_back(
reinterpret_cast<char*>(single_slice_buffer->frontSlice().mem_), slice.len_,
[single_slice_buffer = std::move(single_slice_buffer)](const char*) {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we add a comment here to explain what's going on. At first glance, it looks like we move the data into a new unique_ptr, and then pass a pointer to the data as the first argument to this function and then the local variable goes out of scope and deletes the data. But of course, thats' not what happens because in the lambda capture, we std::move() that unique ptr into a captured variable which lives until that invokable goes out of scope when the slice is destructed. Do I have that right?

In other words, the only purpose of the callback is to capture the unique_ptr, not actually to do anything when executed (The action happens when the callback is destroyed)? We could also explicitly clear single_slice_buffer in the body of the callback if we wanted to be a bit more explicit, though I'm not sure how useful that is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think clearing that explicitly is inherently more readable and makes it easier to reason about the lifetime (since then it's tied to the invocation of the releasor, as opposed to the destructor). That's not a bad idea.

@soulxu
Copy link
Member

soulxu commented Dec 8, 2023

/assign @RyanTheOptimist

@steveWang
Copy link
Contributor Author

nit: Please mention the runtime guard in the PR description.

Did we get to the bottom of the test failures that were seen earlier?

PR description updated.

I never got to the bottom of the test failure, as I could never reproduce it, and the stack trace didn't point to buffer management being the culprit. I see a few other failures:

  • Coverage for source/common/quic went down. I'll add a test that explicitly enables the runtime guard.
  • TSAN is failing for some LDS interaction with QUIC early data, with QUICHE_BUG failure: !one_rtt_keys_available(). I can't see any reason that would be related.
  • ASAN is complaining about a use-after-free of IntegrationStreamDecoder (resetStream called after the decoder was cleaned up). I'm fairly confident this failure isn't related, since this PR only touches buffer management.

This aligns lifetime better with the intent of the releasor API.

Signed-off-by: Steve Wang <[email protected]>
These are duplicated from other tests.

(If it's preferable, I can instead parameterize the entire test suite,
but that seemed more intrusive.)

Signed-off-by: Steve Wang <[email protected]>
RyanTheOptimist
RyanTheOptimist previously approved these changes Dec 8, 2023
@RyanTheOptimist RyanTheOptimist enabled auto-merge (squash) December 8, 2023 16:30
@steveWang
Copy link
Contributor Author

But only under gcc? Bizarre.

@RyanTheOptimist
Copy link
Contributor

But only under gcc? Bizarre.

Also windows. :/ Wonder if there are some compiler shenanigans going on here

This may explain why we only see failing tests on client streams, and
not on the server streams.

I haven't been able to reproduce the failures locally but am still
looking into them.

Signed-off-by: Steve Wang <[email protected]>
@steveWang
Copy link
Contributor Author

I looked at compiler support for init-captures in lambdas, but that came about in C++14 and should have been supported by both gcc and msvc for 5+ years (and should certainly be in the versions that Envoy uses).

Anyways, I'm pushing a new commit that should cause envoy_quic_server_stream_test to start failing in the same way, if this is reproducible.

auto-merge was automatically disabled December 8, 2023 21:22

Head branch was pushed to by a user without write access

auto single_slice_buffer = std::make_unique<Buffer::OwnedImpl>();
single_slice_buffer->move(data, slice.len_);
quic_slices.emplace_back(
reinterpret_cast<char*>(single_slice_buffer->frontSlice().mem_), slice.len_,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bazel -c opt --config=gcc seems to complain about nullptr here. I guess this may be due to an issue with function parameter evaluation order?

It's likely safer to just use slice.mem_ here. (That may or may not be the segfault / crash I'm seeing.)

If the lambda is created before we access frontSlice().mem_, then this
may result in a nullptr dereference. This seems to be
implementation-defined.

Signed-off-by: Steve Wang <[email protected]>
@RyanTheOptimist
Copy link
Contributor

/wait

@steveWang
Copy link
Contributor Author

steveWang commented Dec 11, 2023

Failed TSAN test: UpstreamProtocols/DownstreamProtocolIntegrationTest.ManyLargeRequestHeadersAccepted/IPv4_Http2Downstream_Http3UpstreamHttpParserNghttp2NoDeferredProcessingLegacy

I don't expect this to be related, since the new logic should be runtime-guard-disabled in most tests.

The other previously-failing presubmits are now passing (gcc, windows, coverage), so I'm inclined to guess this is a flaky test.

@RyanTheOptimist RyanTheOptimist merged commit 83eaa8a into envoyproxy:main Dec 12, 2023
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants