Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

udp: support forwarding packets between workers #13086

Merged
merged 45 commits into from
Sep 23, 2020

Conversation

ggreenway
Copy link
Contributor

@ggreenway ggreenway commented Sep 14, 2020

Signed-off-by: Greg Greenway [email protected]

Commit Message: Also, a few fixes in active_quic_listener that were discovered while fixing
tests to work with this change:

  • Buffered CHLOs wouldn't be processed unless another readable event occurred on the socket. Fixed it so that some CHLOs are processed every event loop until the queue is empty.
  • Disabling a quic listener wouldn't immediately stop accepting new connections; it only stopped accepting connections that were buffered. The implementation would allow up to 16 connections to be accepted after disabling the listener. This made it tricky to make tests work consistently. Fixed by having also checking the runtime disable in onData().

Additional Description:
Risk Level: Medium
Testing: Fixed all existing tests; still need to add some new
Docs Changes:
Release Notes: This is mostly invisible to the user; will be used to build user-facing features later
[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]

Also, a few fixes in active_quic_listener that were discovered while fixing
tests to work with this change.

Signed-off-by: Greg Greenway <[email protected]>
@ggreenway
Copy link
Contributor Author

I took inspiration on this from the ConnectionBalancer code, since it's doing a similar operation.

@ggreenway
Copy link
Contributor Author

@lambdai @mattklein123 would the crash fixed in #12748 logically apply to this change as well? Eg, should I use a shared_ptr in the same way #12748 did?

Copy link
Contributor

@danzh2010 danzh2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice packet re-route interface! I have some thought regarding to the QUIC fix. Overall looks good!

Signed-off-by: Greg Greenway <[email protected]>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Flushing out a few high level comments before diving into the detailed review.

/wait

include/envoy/network/listener.h Outdated Show resolved Hide resolved
source/common/network/udp_listener_impl.cc Outdated Show resolved Hide resolved
source/common/runtime/runtime_features.cc Show resolved Hide resolved
include/envoy/network/listener.h Outdated Show resolved Hide resolved
include/envoy/server/worker.h Outdated Show resolved Hide resolved
the dispatcher for processing buffered chlos

Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM at a high level. Flushing out a few more comments. Thank you!

/wait

include/envoy/network/connection_handler.h Outdated Show resolved Hide resolved
source/common/runtime/runtime_features.cc Show resolved Hide resolved
@ggreenway
Copy link
Contributor Author

ci failure due to disk full

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines, to retry CircleCI checks, use /retest-circle.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13086 (comment) was created by @ggreenway.

see: more, trace.

@ggreenway
Copy link
Contributor Author

coverage error wasn't caused by this PR:

Code coverage 96.5 is good and higher than limit of 96.5
Checking per-extension coverage
Per-extension coverage failed:
Code coverage for source/server/admin is lower than limit of 95.5 (95.3)

Copy link
Contributor

@danzh2010 danzh2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, alone aside the renaming request. Also can you speak of #13086 (comment)?

Network::UdpPacketWriterPtr udp_packet_writer =
std::make_unique<Quic::UdpGsoBatchWriter>(io_handle, scope);
return udp_packet_writer;
#if UDP_GSO_BATCH_WRITER_PLATFORM_SUPPORT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now GSO support is determined at runtime. But GSO code can also compile at the linux kernels that doesn't support it. I think what we want here is PLATFORM_SUPPORT_POSIX or something like that.

@ggreenway
Copy link
Contributor Author

Also can you speak of #13086 (comment)?

Sorry, I missed that earlier. Replied.

@ggreenway
Copy link
Contributor Author

@mattklein123 can you give this a final pass/approval?

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is really neat. Just a few small things from another pass through. Thank you!

/wait

@@ -147,8 +154,8 @@ class ActiveUdpListenerFactory {
* @return the ActiveUdpListener created.
*/
virtual ConnectionHandler::ActiveListenerPtr
createActiveUdpListener(ConnectionHandler& parent, Event::Dispatcher& disptacher,
Network::ListenerConfig& config) PURE;
createActiveUdpListener(uint32_t worker_id, ConnectionHandler& parent,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: worker_index, here and anywhere else we still have id. Also needs doc comment update.

Network::UdpListenerCallbacks* udp_listener =
dynamic_cast<ActiveUdpListenerBase*>(details.listener_.get());
ASSERT(udp_listener != nullptr);
details.typed_listener_ = *static_cast<Network::UdpListenerCallbacks*>(udp_listener);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this static cast needed? It's the type of the variable above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That must have been added during some previous variation. Removed.

Comment on lines 62 to 63
Network::UdpListenerCallbacks* udp_listener =
dynamic_cast<ActiveUdpListenerBase*>(details.listener_.get());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the dynamic_cast needed here? This seems bad? Can you add a TODO of some type to figure out how to clean this up?

IIRC there is some mess here that we have dealt with in the past around extensions actually implementing active listener base, etc.? Is that right? It would be good to have a comment if that is the case. I know I have commented on similar things in the past. I think possible if we make QUIC/QUICHE not extension based this gets better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I fixed this by promoting part of ActiveUdpListenerBase to /include, so that the factory can return that type. No more casts needed. Take a look at the commit; I had to switch inheritance of ActiveListener to virtual. If we don't like that, the same can be accomplished by making ActiveListenerImplBase into a mixin template, or templatizing whether it inherits from ActiveListener or ActiveUdpListener. Either is fine with me.

Comment on lines +344 to +346
absl::variant<absl::monostate, std::reference_wrapper<ActiveTcpListener>,
std::reference_wrapper<Network::UdpListenerCallbacks>>
typed_listener_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, why do we need the variant? Is this just because it truly is a variant and it avoids multiple variables that are XOR? Maybe just add a small comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a safe union

Signed-off-by: Greg Greenway <[email protected]>
Signed-off-by: Greg Greenway <[email protected]>
needing unsafe dynamic_cast

Signed-off-by: Greg Greenway <[email protected]>
danzh2010
danzh2010 previously approved these changes Sep 22, 2020
mattklein123
mattklein123 previously approved these changes Sep 22, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic, thanks for the cleanups!

@ggreenway
Copy link
Contributor Author

coverage error wasn't caused by this PR:

Code coverage 96.5 is good and higher than limit of 96.5
Checking per-extension coverage
Per-extension coverage failed:
Code coverage for source/server/admin is lower than limit of 95.5 (95.3)

Same on latest run

@ggreenway
Copy link
Contributor Author

coverage error wasn't caused by this PR:

Code coverage 96.5 is good and higher than limit of 96.5
Checking per-extension coverage
Per-extension coverage failed:
Code coverage for source/server/admin is lower than limit of 95.5 (95.3)

Same on latest run

Actually, it looks like the stub I added is counting against coverage:

    Network::UdpListenerWorkerRouterOptRef udpListenerWorkerRouter() override {
      NOT_REACHED_GCOVR_EXCL_LINE;
    }

Signed-off-by: Greg Greenway <[email protected]>
@ggreenway
Copy link
Contributor Author

asan failure says (killed) on.a compilation step, so I think it's some resource exhaustion, not a real failure caused by this PR. @mattklein123 can you approve one more time?

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines, to retry CircleCI checks, use /retest-circle.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #13086 (comment) was created by @ggreenway.

see: more, trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants