Use QuicheMemSlice constructor with releasor. #31167

steveWang · 2023-12-04T16:43:15Z

The existing model of forwarding to the quiche::MemSliceImpl constructor is an abstraction violation and makes it hard to swap out the underlying MemSlice platform implementation. As such, we've recently added a new constructor to QuicheMemSlice (and the Impl API) that allows for an arbitrary custom releasor.

This PR uses the new constructor, guarded behind the runtime flag envoy.reloadable_features.quiche_use_mem_slice_releasor_api (disabled by default).

Since we're storing a unique_ptr in the capture list (allowable since C++17), we can't store this in a std::function which requires copyability. As such, we use absl::AnyInvocable.

(I've marked this as medium risk since it interacts with Envoy's memory management model for QUIC streams and so deserves some scrutiny.)

Commit Message: Use QuicheMemSlice constructor with releasor.
Additional Description:
Risk Level: Medium
Testing: This is a refactor, so it should be covered by existing tests (test/common/buffer:buffer_test, test/common/quic:envoy_quic_{client,server}_stream_test)
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

The existing model of forwarding to the quiche::MemSliceImpl constructor is an abstraction violation and makes it hard to swap out the underlying MemSlice platform implementation. As such, we've recently added a new constructor to QuicheMemSlice (and the Impl API) that allows for an arbitrary custom releasor. This commit is a little wonky in that it only really uses a lambda releasor for its capture list to manage object lifetime, and so the lambda body is intentionally empty. Since we're storing a unique_ptr in the capture list (allowable since C++17), we can't store this in a std::function which requires copyability. As such, we use absl::AnyInvocable. Signed-off-by: Steve Wang <[email protected]>

Signed-off-by: Steve Wang <[email protected]>

The custom deleter's lifetime is bound to the BufferFragmentImpl, not the Apparently when I provided the sample Envoy implementation, I had inadvertently reverted the buffer_impl changes to use absl::AnyInvocable, which were necessary to make this compile... Signed-off-by: Steve Wang <[email protected]>

steveWang · 2023-12-04T21:13:02Z

There is one failing test (test/integration/quic_protocol_integration_test) which seems to be a segfault when closing QUIC connections. Given that the focus of this PR involves QUIC connection lifetime management, I would not be surprised if it's a breakage that I need to track down.

soulxu · 2023-12-05T07:15:55Z

/assign @danzh2010

danzh2010 · 2023-12-05T23:33:31Z

There is one failing test (test/integration/quic_protocol_integration_test) which seems to be a segfault when closing QUIC connections. Given that the focus of this PR involves QUIC connection lifetime management, I would not be surprised if it's a breakage that I need to track down.

The crash stack might be a test setup issue irrelevant to this PR, but I couldn't repro it locally under opt build with ASAN in a clean branch. If you are able to repro and need help to look into the debug log, I'd be happy to help.

steveWang · 2023-12-05T23:47:33Z

I couldn't reproduce it running locally in this branch with --compilation_mode=opt, but I did look at the stack trace in the test logs (quoted below for convenience). It didn't obviously seem related to this PR, but it still makes me nervous.

stack_decode.py didn't seem to consistently identify line numbers, even when building with --config=-gmlt to preserve debug symbols (after the first two lines, it started claiming Envoy code was coming from things like <vector>).

[critical][backtrace] [./source/server/backtrace.h:96] #1: Envoy::Http::CodecClient::onEvent() [0x15c35cf]
[critical][backtrace] [./source/server/backtrace.h:96] #2: Envoy::Network::ConnectionImplBase::raiseConnectionEvent() [0x19c595e]
[critical][backtrace] [./source/server/backtrace.h:96] #3: Envoy::Quic::QuicFilterManagerConnectionImpl::onConnectionCloseEvent() [0x1867dd8]
[critical][backtrace] [./source/server/backtrace.h:96] #4: Envoy::Quic::EnvoyQuicClientSession::OnConnectionClosed() [0x17d8f35]
[critical][backtrace] [./source/server/backtrace.h:96] #5: quic::QuicConnection::TearDownLocalConnectionState() [0x18d377f]
[critical][backtrace] [./source/server/backtrace.h:96] #6: quic::QuicConnection::TearDownLocalConnectionState() [0x18d52ee]
[critical][backtrace] [./source/server/backtrace.h:96] #7: Envoy::Quic::QuicFilterManagerConnectionImpl::closeConnectionImmediately() [0x186831b]
[critical][backtrace] [./source/server/backtrace.h:96] #8: Envoy::HttpIntegrationTest::cleanupUpstreamAndDownstream() [0xc94259]
[critical][backtrace] [./source/server/backtrace.h:96] #9: Envoy::HttpIntegrationTest::~HttpIntegrationTest() [0xc93e20]
[critical][backtrace] [./source/server/backtrace.h:96] #10: Envoy::DownstreamProtocolIntegrationTest_AddInvalidEncodedData_Test::~DownstreamProtocolIntegrationTest_AddInvalidEncodedData_Test() [0xc38602]
[critical][backtrace] [./source/server/backtrace.h:96] #11: testing::internal::HandleExceptionsInMethodIfSupported<>() [0x2081ecc]
[critical][backtrace] [./source/server/backtrace.h:96] #12: testing::TestInfo::Run() [0x2083014]
[critical][backtrace] [./source/server/backtrace.h:96] #13: testing::TestSuite::Run() [0x2083d69]
[critical][backtrace] [./source/server/backtrace.h:96] #14: testing::internal::UnitTestImpl::RunAllTests() [0x209129f]
[critical][backtrace] [./source/server/backtrace.h:96] #15: testing::internal::HandleExceptionsInMethodIfSupported<>() [0x2090d0c]
[critical][backtrace] [./source/server/backtrace.h:96] #16: testing::UnitTest::Run() [0x2090b8f]
[critical][backtrace] [./source/server/backtrace.h:96] #17: Envoy::TestRunner::runTests() [0x1956ed0]
[critical][backtrace] [./source/server/backtrace.h:96] #18: main [0x1955a9e]

danzh2010 · 2023-12-06T00:05:21Z

source/common/quic/envoy_quic_client_stream.cc

@@ -147,7 +147,11 @@ void EnvoyQuicClientStream::encodeData(Buffer::Instance& data, bool end_stream)
      // TODO(danzh): investigate the cost of allocating one buffer per slice.
      // If it turns out to be expensive, add a new function to free data in the middle in buffer
      // interface and re-design QuicheMemSliceImpl.
-      quic_slices.emplace_back(quiche::QuicheMemSlice::InPlace(), data, slice.len_);
+      auto single_slice_buffer = std::make_unique<Buffer::OwnedImpl>();


This change I would love a way to fallback in case it goes wrong. RUNTIME_GUARD isn't an option here because we want to compile out the old way internally. Would #ifdef SOME_PREPROCESSOR be acceptable? @alyssawilk

do we need to compile out the old way overnight? Can we do the usual flip-wait-deprecate?

flip-wait-deprecate takes 6 months, and preferably we don't want to wait for that long.

ah so for functional changes you need to wait 6 months for operators to update and test, but for "I want to guard this in case it causes crashes but really things look fine" changes you can remove at your own pace. Does that help any?

As Dan mentions, we'd like to be able to change the compile-time dependencies (in particular so we no longer depend on a custom constructor in Envoy platform's QuicheMemSliceImpl).

I'm happy to leave the default behavior as-is and protect this new behavior behind a compile-time guard so everyone who doesn't care about swapping out this dependency is unaffected, but guarding it behind a runtime flag means we can't actually swap out the underlying implementation until the old way is removed.

I'm not sure how we get a fallback and keep the code in, so I was trying to offer a less painful timeline to get both in sequence.

Right. I don't think a runtime fallback is feasible for what I want to do.

@RyanTheOptimist FYI. I think we want to use a runtime guard with an ifdef. Something like...

#ifdef USE_QUICHE_MEM_SLICE_RELEASOR_API_EXCLUSIVELY if (!RUNTIME_GUARD(use_quiche_mem_slice_releasor)) { quic_slices.emplace_back(quiche::QuicheMemSlice::InPlace(), data, slice.len_); } else { #endif auto single_slice_buffer = std::make_unique<Buffer::OwnedImpl>(); // ... #ifdef USE_QUICHE_MEM_SLICE_RELEASOR_API_EXCLUSIVELY } #endif

(Rough pseudocode. I'll send a commit along these lines later today.)

We want to be able to compile out the existing API. Once we've put some mileage on this implementation, we can set this to be default true for other Envoy users. Signed-off-by: Steve Wang <[email protected]>

repokitteh-read-only · 2023-12-06T20:30:41Z

CC @envoyproxy/runtime-guard-changes: FYI only for changes made to (source/common/runtime/runtime_features.cc).

🐱

Caused by: #31167 was synchronize by steveWang.

see: more, trace.

We'll manage this with a short-term patch for our internal import. Signed-off-by: Steve Wang <[email protected]>

RyanTheOptimist

nit: Please mention the runtime guard in the PR description.

Did we get to the bottom of the test failures that were seen earlier?

RyanTheOptimist · 2023-12-07T16:04:31Z

source/common/quic/envoy_quic_client_stream.cc

+        single_slice_buffer->move(data, slice.len_);
+        quic_slices.emplace_back(
+            reinterpret_cast<char*>(single_slice_buffer->frontSlice().mem_), slice.len_,
+            [single_slice_buffer = std::move(single_slice_buffer)](const char*) {});


nit: Can we add a comment here to explain what's going on. At first glance, it looks like we move the data into a new unique_ptr, and then pass a pointer to the data as the first argument to this function and then the local variable goes out of scope and deletes the data. But of course, thats' not what happens because in the lambda capture, we std::move() that unique ptr into a captured variable which lives until that invokable goes out of scope when the slice is destructed. Do I have that right?

In other words, the only purpose of the callback is to capture the unique_ptr, not actually to do anything when executed (The action happens when the callback is destroyed)? We could also explicitly clear single_slice_buffer in the body of the callback if we wanted to be a bit more explicit, though I'm not sure how useful that is.

I think clearing that explicitly is inherently more readable and makes it easier to reason about the lifetime (since then it's tied to the invocation of the releasor, as opposed to the destructor). That's not a bad idea.

soulxu · 2023-12-08T01:16:40Z

/assign @RyanTheOptimist

steveWang · 2023-12-08T15:30:15Z

nit: Please mention the runtime guard in the PR description.

Did we get to the bottom of the test failures that were seen earlier?

PR description updated.

I never got to the bottom of the test failure, as I could never reproduce it, and the stack trace didn't point to buffer management being the culprit. I see a few other failures:

Coverage for source/common/quic went down. I'll add a test that explicitly enables the runtime guard.
TSAN is failing for some LDS interaction with QUIC early data, with QUICHE_BUG failure: !one_rtt_keys_available(). I can't see any reason that would be related.
ASAN is complaining about a use-after-free of IntegrationStreamDecoder (resetStream called after the decoder was cleaned up). I'm fairly confident this failure isn't related, since this PR only touches buffer management.

This aligns lifetime better with the intent of the releasor API. Signed-off-by: Steve Wang <[email protected]>

These are duplicated from other tests. (If it's preferable, I can instead parameterize the entire test suite, but that seemed more intrusive.) Signed-off-by: Steve Wang <[email protected]>

RyanTheOptimist · 2023-12-08T18:04:16Z

Looks like real test failures:

https://github.com/envoyproxy/envoy/actions/runs/7143515779/job/19455192451?pr=31167

https://dev.azure.com/cncf/envoy/_build/results?buildId=157293&view=logs&jobId=8bf29878-a4cc-50f7-4e84-2255e6fd4065&j=8bf29878-a4cc-50f7-4e84-2255e6fd4065&t=150a3762-112b-5713-0d28-af1eb6c2fbf4

steveWang · 2023-12-08T20:44:17Z

But only under gcc? Bizarre.

RyanTheOptimist · 2023-12-08T20:57:11Z

But only under gcc? Bizarre.

Also windows. :/ Wonder if there are some compiler shenanigans going on here

This may explain why we only see failing tests on client streams, and not on the server streams. I haven't been able to reproduce the failures locally but am still looking into them. Signed-off-by: Steve Wang <[email protected]>

steveWang · 2023-12-08T21:18:39Z

I looked at compiler support for init-captures in lambdas, but that came about in C++14 and should have been supported by both gcc and msvc for 5+ years (and should certainly be in the versions that Envoy uses).

Anyways, I'm pushing a new commit that should cause envoy_quic_server_stream_test to start failing in the same way, if this is reproducible.

Signed-off-by: Steve Wang <[email protected]>

steveWang · 2023-12-11T17:27:56Z

source/common/quic/envoy_quic_client_stream.cc

+        auto single_slice_buffer = std::make_unique<Buffer::OwnedImpl>();
+        single_slice_buffer->move(data, slice.len_);
+        quic_slices.emplace_back(
+            reinterpret_cast<char*>(single_slice_buffer->frontSlice().mem_), slice.len_,


bazel -c opt --config=gcc seems to complain about nullptr here. I guess this may be due to an issue with function parameter evaluation order?

It's likely safer to just use slice.mem_ here. (That may or may not be the segfault / crash I'm seeing.)

If the lambda is created before we access frontSlice().mem_, then this may result in a nullptr dereference. This seems to be implementation-defined. Signed-off-by: Steve Wang <[email protected]>

RyanTheOptimist · 2023-12-11T17:40:15Z

/wait

steveWang · 2023-12-11T20:08:34Z

Failed TSAN test: UpstreamProtocols/DownstreamProtocolIntegrationTest.ManyLargeRequestHeadersAccepted/IPv4_Http2Downstream_Http3UpstreamHttpParserNghttp2NoDeferredProcessingLegacy

I don't expect this to be related, since the new logic should be runtime-guard-disabled in most tests.

The other previously-failing presubmits are now passing (gcc, windows, coverage), so I'm inclined to guess this is a flaky test.

steveWang added 2 commits December 4, 2023 11:28

Remove names for unused parameters.

4d47164

Signed-off-by: Steve Wang <[email protected]>

steveWang marked this pull request as draft December 4, 2023 18:08

steveWang marked this pull request as ready for review December 4, 2023 20:36

repokitteh-read-only bot assigned danzh2010 Dec 5, 2023

danzh2010 reviewed Dec 6, 2023

View reviewed changes

Add runtime and compile-time guard.

d9fa5b7

We want to be able to compile out the existing API. Once we've put some mileage on this implementation, we can set this to be default true for other Envoy users. Signed-off-by: Steve Wang <[email protected]>

Remove ifdef guard.

297eea0

We'll manage this with a short-term patch for our internal import. Signed-off-by: Steve Wang <[email protected]>

RyanTheOptimist reviewed Dec 7, 2023

View reviewed changes

repokitteh-read-only bot assigned RyanTheOptimist Dec 8, 2023

steveWang added 2 commits December 8, 2023 10:59

Clean up memslice storage on calling releasor.

19aa02f

This aligns lifetime better with the intent of the releasor API. Signed-off-by: Steve Wang <[email protected]>

Add tests enabling releasor API runtime guard.

f7863c4

These are duplicated from other tests. (If it's preferable, I can instead parameterize the entire test suite, but that seemed more intrusive.) Signed-off-by: Steve Wang <[email protected]>

RyanTheOptimist previously approved these changes Dec 8, 2023

View reviewed changes

danzh2010 approved these changes Dec 8, 2023

View reviewed changes

RyanTheOptimist enabled auto-merge (squash) December 8, 2023 16:30

Call encodeData in envoy_quic_server_stream_test.

9736fa4

This may explain why we only see failing tests on client streams, and not on the server streams. I haven't been able to reproduce the failures locally but am still looking into them. Signed-off-by: Steve Wang <[email protected]>

auto-merge was automatically disabled December 8, 2023 21:22
Head branch was pushed to by a user without write access

steveWang dismissed RyanTheOptimist’s stale review via 9736fa4 December 8, 2023 21:22

Merge branch 'main' into quiche-buffer

42a1222

Signed-off-by: Steve Wang <[email protected]>

steveWang commented Dec 11, 2023

View reviewed changes

Address use-after-move of single_slice_buffer.

44af392

If the lambda is created before we access frontSlice().mem_, then this may result in a nullptr dereference. This seems to be implementation-defined. Signed-off-by: Steve Wang <[email protected]>

repokitteh-read-only bot added waiting and removed waiting labels Dec 11, 2023

steveWang requested a review from RyanTheOptimist December 12, 2023 18:42

RyanTheOptimist approved these changes Dec 12, 2023

View reviewed changes

RyanTheOptimist merged commit 83eaa8a into envoyproxy:main Dec 12, 2023
53 checks passed

adisuissa mentioned this pull request Mar 6, 2024

Delete unused runtime flag. #32739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use QuicheMemSlice constructor with releasor. #31167

Use QuicheMemSlice constructor with releasor. #31167

steveWang commented Dec 4, 2023 •

edited

Loading

steveWang commented Dec 4, 2023

soulxu commented Dec 5, 2023

danzh2010 commented Dec 5, 2023

steveWang commented Dec 5, 2023 •

edited

Loading

danzh2010 Dec 6, 2023

alyssawilk Dec 6, 2023

danzh2010 Dec 6, 2023

alyssawilk Dec 6, 2023

steveWang Dec 6, 2023

alyssawilk Dec 6, 2023

steveWang Dec 6, 2023

steveWang Dec 6, 2023

repokitteh-read-only bot commented Dec 6, 2023

RyanTheOptimist left a comment

RyanTheOptimist Dec 7, 2023

steveWang Dec 8, 2023

soulxu commented Dec 8, 2023

steveWang commented Dec 8, 2023

RyanTheOptimist commented Dec 8, 2023

steveWang commented Dec 8, 2023

RyanTheOptimist commented Dec 8, 2023

steveWang commented Dec 8, 2023

steveWang Dec 11, 2023

RyanTheOptimist commented Dec 11, 2023

steveWang commented Dec 11, 2023 •

edited

Loading

Use QuicheMemSlice constructor with releasor. #31167

Use QuicheMemSlice constructor with releasor. #31167

Conversation

steveWang commented Dec 4, 2023 • edited Loading

steveWang commented Dec 4, 2023

soulxu commented Dec 5, 2023

danzh2010 commented Dec 5, 2023

steveWang commented Dec 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

repokitteh-read-only bot commented Dec 6, 2023

RyanTheOptimist left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soulxu commented Dec 8, 2023

steveWang commented Dec 8, 2023

RyanTheOptimist commented Dec 8, 2023

steveWang commented Dec 8, 2023

RyanTheOptimist commented Dec 8, 2023

steveWang commented Dec 8, 2023

Choose a reason for hiding this comment

RyanTheOptimist commented Dec 11, 2023

steveWang commented Dec 11, 2023 • edited Loading

steveWang commented Dec 4, 2023 •

edited

Loading

steveWang commented Dec 5, 2023 •

edited

Loading

steveWang commented Dec 11, 2023 •

edited

Loading