-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use QuicheMemSlice constructor with releasor. #31167
Conversation
The existing model of forwarding to the quiche::MemSliceImpl constructor is an abstraction violation and makes it hard to swap out the underlying MemSlice platform implementation. As such, we've recently added a new constructor to QuicheMemSlice (and the Impl API) that allows for an arbitrary custom releasor. This commit is a little wonky in that it only really uses a lambda releasor for its capture list to manage object lifetime, and so the lambda body is intentionally empty. Since we're storing a unique_ptr in the capture list (allowable since C++17), we can't store this in a std::function which requires copyability. As such, we use absl::AnyInvocable. Signed-off-by: Steve Wang <[email protected]>
Signed-off-by: Steve Wang <[email protected]>
The custom deleter's lifetime is bound to the BufferFragmentImpl, not the Apparently when I provided the sample Envoy implementation, I had inadvertently reverted the buffer_impl changes to use absl::AnyInvocable, which were necessary to make this compile... Signed-off-by: Steve Wang <[email protected]>
There is one failing test (test/integration/quic_protocol_integration_test) which seems to be a segfault when closing QUIC connections. Given that the focus of this PR involves QUIC connection lifetime management, I would not be surprised if it's a breakage that I need to track down. |
/assign @danzh2010 |
The crash stack might be a test setup issue irrelevant to this PR, but I couldn't repro it locally under opt build with ASAN in a clean branch. If you are able to repro and need help to look into the debug log, I'd be happy to help. |
I couldn't reproduce it running locally in this branch with --compilation_mode=opt, but I did look at the stack trace in the test logs (quoted below for convenience). It didn't obviously seem related to this PR, but it still makes me nervous. stack_decode.py didn't seem to consistently identify line numbers, even when building with --config=-gmlt to preserve debug symbols (after the first two lines, it started claiming Envoy code was coming from things like
|
@@ -147,7 +147,11 @@ void EnvoyQuicClientStream::encodeData(Buffer::Instance& data, bool end_stream) | |||
// TODO(danzh): investigate the cost of allocating one buffer per slice. | |||
// If it turns out to be expensive, add a new function to free data in the middle in buffer | |||
// interface and re-design QuicheMemSliceImpl. | |||
quic_slices.emplace_back(quiche::QuicheMemSlice::InPlace(), data, slice.len_); | |||
auto single_slice_buffer = std::make_unique<Buffer::OwnedImpl>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change I would love a way to fallback in case it goes wrong. RUNTIME_GUARD isn't an option here because we want to compile out the old way internally. Would #ifdef SOME_PREPROCESSOR
be acceptable? @alyssawilk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to compile out the old way overnight? Can we do the usual flip-wait-deprecate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flip-wait-deprecate takes 6 months, and preferably we don't want to wait for that long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah so for functional changes you need to wait 6 months for operators to update and test, but for "I want to guard this in case it causes crashes but really things look fine" changes you can remove at your own pace. Does that help any?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Dan mentions, we'd like to be able to change the compile-time dependencies (in particular so we no longer depend on a custom constructor in Envoy platform's QuicheMemSliceImpl).
I'm happy to leave the default behavior as-is and protect this new behavior behind a compile-time guard so everyone who doesn't care about swapping out this dependency is unaffected, but guarding it behind a runtime flag means we can't actually swap out the underlying implementation until the old way is removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how we get a fallback and keep the code in, so I was trying to offer a less painful timeline to get both in sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I don't think a runtime fallback is feasible for what I want to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@RyanTheOptimist FYI. I think we want to use a runtime guard with an ifdef. Something like...
#ifdef USE_QUICHE_MEM_SLICE_RELEASOR_API_EXCLUSIVELY
if (!RUNTIME_GUARD(use_quiche_mem_slice_releasor)) {
quic_slices.emplace_back(quiche::QuicheMemSlice::InPlace(), data, slice.len_);
} else {
#endif
auto single_slice_buffer = std::make_unique<Buffer::OwnedImpl>();
// ...
#ifdef USE_QUICHE_MEM_SLICE_RELEASOR_API_EXCLUSIVELY
}
#endif
(Rough pseudocode. I'll send a commit along these lines later today.)
We want to be able to compile out the existing API. Once we've put some mileage on this implementation, we can set this to be default true for other Envoy users. Signed-off-by: Steve Wang <[email protected]>
We'll manage this with a short-term patch for our internal import. Signed-off-by: Steve Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please mention the runtime guard in the PR description.
Did we get to the bottom of the test failures that were seen earlier?
single_slice_buffer->move(data, slice.len_); | ||
quic_slices.emplace_back( | ||
reinterpret_cast<char*>(single_slice_buffer->frontSlice().mem_), slice.len_, | ||
[single_slice_buffer = std::move(single_slice_buffer)](const char*) {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we add a comment here to explain what's going on. At first glance, it looks like we move the data into a new unique_ptr, and then pass a pointer to the data as the first argument to this function and then the local variable goes out of scope and deletes the data. But of course, thats' not what happens because in the lambda capture, we std::move() that unique ptr into a captured variable which lives until that invokable goes out of scope when the slice is destructed. Do I have that right?
In other words, the only purpose of the callback is to capture the unique_ptr, not actually to do anything when executed (The action happens when the callback is destroyed)? We could also explicitly clear single_slice_buffer
in the body of the callback if we wanted to be a bit more explicit, though I'm not sure how useful that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think clearing that explicitly is inherently more readable and makes it easier to reason about the lifetime (since then it's tied to the invocation of the releasor, as opposed to the destructor). That's not a bad idea.
/assign @RyanTheOptimist |
PR description updated. I never got to the bottom of the test failure, as I could never reproduce it, and the stack trace didn't point to buffer management being the culprit. I see a few other failures:
|
This aligns lifetime better with the intent of the releasor API. Signed-off-by: Steve Wang <[email protected]>
These are duplicated from other tests. (If it's preferable, I can instead parameterize the entire test suite, but that seemed more intrusive.) Signed-off-by: Steve Wang <[email protected]>
Looks like real test failures: https://github.com/envoyproxy/envoy/actions/runs/7143515779/job/19455192451?pr=31167 |
But only under gcc? Bizarre. |
Also windows. :/ Wonder if there are some compiler shenanigans going on here |
This may explain why we only see failing tests on client streams, and not on the server streams. I haven't been able to reproduce the failures locally but am still looking into them. Signed-off-by: Steve Wang <[email protected]>
I looked at compiler support for init-captures in lambdas, but that came about in C++14 and should have been supported by both gcc and msvc for 5+ years (and should certainly be in the versions that Envoy uses). Anyways, I'm pushing a new commit that should cause envoy_quic_server_stream_test to start failing in the same way, if this is reproducible. |
Head branch was pushed to by a user without write access
Signed-off-by: Steve Wang <[email protected]>
auto single_slice_buffer = std::make_unique<Buffer::OwnedImpl>(); | ||
single_slice_buffer->move(data, slice.len_); | ||
quic_slices.emplace_back( | ||
reinterpret_cast<char*>(single_slice_buffer->frontSlice().mem_), slice.len_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bazel -c opt --config=gcc seems to complain about nullptr here. I guess this may be due to an issue with function parameter evaluation order?
It's likely safer to just use slice.mem_ here. (That may or may not be the segfault / crash I'm seeing.)
If the lambda is created before we access frontSlice().mem_, then this may result in a nullptr dereference. This seems to be implementation-defined. Signed-off-by: Steve Wang <[email protected]>
/wait |
Failed TSAN test: UpstreamProtocols/DownstreamProtocolIntegrationTest.ManyLargeRequestHeadersAccepted/IPv4_Http2Downstream_Http3UpstreamHttpParserNghttp2NoDeferredProcessingLegacy I don't expect this to be related, since the new logic should be runtime-guard-disabled in most tests. The other previously-failing presubmits are now passing (gcc, windows, coverage), so I'm inclined to guess this is a flaky test. |
The existing model of forwarding to the quiche::MemSliceImpl constructor is an abstraction violation and makes it hard to swap out the underlying MemSlice platform implementation. As such, we've recently added a new constructor to QuicheMemSlice (and the Impl API) that allows for an arbitrary custom releasor.
This PR uses the new constructor, guarded behind the runtime flag
envoy.reloadable_features.quiche_use_mem_slice_releasor_api
(disabled by default).Since we're storing a unique_ptr in the capture list (allowable since C++17), we can't store this in a std::function which requires copyability. As such, we use absl::AnyInvocable.
(I've marked this as medium risk since it interacts with Envoy's memory management model for QUIC streams and so deserves some scrutiny.)
Commit Message: Use QuicheMemSlice constructor with releasor.
Additional Description:
Risk Level: Medium
Testing: This is a refactor, so it should be covered by existing tests (test/common/buffer:buffer_test, test/common/quic:envoy_quic_{client,server}_stream_test)
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a