Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support asynchronous host picking #36819

Open
alyssawilk opened this issue Oct 24, 2024 · 12 comments
Open

Support asynchronous host picking #36819

alyssawilk opened this issue Oct 24, 2024 · 12 comments
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue

Comments

@alyssawilk
Copy link
Contributor

Envoy currently has a "problem" in that we're hacking around the problem of doing asynchronous host selection is by doing it in filters, and filters are a sub-optimal place to do this.

An example is the DFP filter. Today you pause processing in the DFP, do an async DNS lookup, add a host, then force-select that host in the DFP load balancer. This works less well for retries - imagine between the initial request going out there's a network change and we want to reresolve DNS. We can pause upcoming requests and block them at the DFP but then any retries (which are done by the router and the DFP filter has no input on) can't get the benefit of that filter.

We have other upcoming cases where we planned to use DEP to do host resolution/creation/selection in a filter that have the same problem - when we retry requests we can't invoke the filter chain because headers have already passed. I see three options here - we could add some hook on retries to let filters have input (which seems pretty fundamentally hacky), we could make chooseHost optionally asynchronous, so if we need to do an RPC to let a sidecar choose, or DNS lookup or some such, we can consistently do it on retries, or we can move host selection to the upstream request and let upstream filters do the resolution. I think that option (2) is architecturally optimal as I think host selection/creation really should be an artifact of LB, so I propose a multi-phase approach where

  1. we split out host selection from connection pool selection and
  2. make host selection an optionally asynchronous option so we can have LBs that do external communication before creating/selecting a host
@alyssawilk
Copy link
Contributor Author

@ggreenway
Copy link
Contributor

I agree with option 2 being ideal, but I'm guessing it will be a pretty big change to implement.

@mattklein123
Copy link
Member

I agree that is ideal and I agree it will be a lot of work, but sgtm!

@wbpcode
Copy link
Member

wbpcode commented Oct 25, 2024

I doubt if this a appropriate way to continue. Note, the core of load balancing is not to choose a host (host is an abtraction or wrapper), but to choose a connection to an address.

And the process of connection creation is async already.

Condsidering the fact the host in the Envoy now cannot be treat as a single address (multiple address support, logic DNS host, etc), so, I think it would be better to do this when we try to create a connection. This process is async anyway, we can try to enhance this to force a new DNS resolving and update address list of host (like the logic DNS host) before actually start a TCP connection.

By this way, we needn't to change the current flow, and the most important benefit is, the most wide users, who use the Envoy with EDS or STATIC clusters, won't be effected by this new feature.

@wbpcode
Copy link
Member

wbpcode commented Oct 25, 2024

I become more and more sensitive on new feature to core code. I will always think whether this could benefit the most users or we should consider it in other way which may be is more complex to the specific feature's users but won't effect the most geneal feature's users.

@alyssawilk
Copy link
Contributor Author

So for DFP we could probably find a hacky way to handle asynchronous DNS lookup in the connect, but I think in general for pluggable LB there would be value in being able to consult a sidecar for destination host. I don't think it's a google-specific feature I think async host selection would be a valuable addition to the LB extension point.
If we opted for what I think is the hackier one off solution where would you suggest we stick the DNS lookup in the mix? The async connect is handled pretty deep in the connection pool which doesn't know anything about lb-specific plugins.

@wbpcode
Copy link
Member

wbpcode commented Oct 28, 2024

So for DFP we could probably find a hacky way to handle asynchronous DNS lookup in the connect, but I think in general for pluggable LB there would be value in being able to consult a sidecar for destination host. I don't think it's a google-specific feature I think async host selection would be a valuable addition to the LB extension point.

I 100% agree that is a great feature. We also have similar requirement (consult external metrics before select a host), recently, but I still not sure if it derserve that (I am not completely object this feature, but worry it may effect our core).

If we opted for what I think is the hackier one off solution where would you suggest we stick the DNS lookup in the mix? The async connect is handled pretty deep in the connection pool which doesn't know anything about lb-specific plugins.

I am not sure, but is it possible to refresh the DNS when call the newStream() of generic pool? The upstream generic pool self is also an extension point.

@alyssawilk
Copy link
Contributor Author

I 100% agree that is a great feature. We also have similar requirement (consult external metrics before select a host), recently, but I still not sure if it derserve that (I am not completely object this feature, but worry it may effect our core).

Ahh, so I definitely want to make it as minimally invasive as I can, and that's a reasonable thing to be concerned about. Here's some more details in the hopes I can allay that concern

So my plan here was to

  1. split the single call into 2 calls as I did in the initial PR. This should be a no-op PR modulo standard "did I do the refactor right" risk.
  2. change chooseHost to returning either a host, or a detachable async handle, akin to what we do for most of our async calls. All currently LBs would return a host. Any current callers of chooseHost getting an async handle would immediately cancel and treat it as a host selection error since it's not yet supported. This is pretty much no risk as it's a return change where no one is using the new option
  3. change the 2 chooseHost call in router.cc to handle async returns. Write an async round robin load balancer for tests, and change to be the default load balancer for HTTP so almost* all HTTP e2e tests exercise this code path. async LB for TCP would be fairly straight-forward but as there's no demand yet and I think it'd be fairly simple as a follow-up, it would be implemented when needed. This is pretty much no risk as nothing but test code is using the new path.

The only work for folks with existing LB plugins would be to change the function signature of their chooseHost function from "return host" to "return host or handle" but would always return host, and incur no additional risk. The only risk I think is taken on by folks who opt into using the new async functionality which would presumably be considered alpha until stress tested in prod.

The PR I have posted is (1) and I have a hacky version of (3) which skips (2). If it'd help I could prototype (2) working just for most of the protocol tests so you could see what the rest looks like, but if possible I'd like to make sure we're aligned before doing that work. The corner cases of integration tests currently not working are 1) TCP proxy (async not implemented, so needs default RR) 2) simtime tests (they don't call alarms until time is manually advanced and async rr currently uses an alarm) and 3) flow control tests which assert backup based on timing (because when we do async host selection we stop-all-iteration which changes flow control subtly, so test expectations aren't met)

@wbpcode
Copy link
Member

wbpcode commented Oct 29, 2024

change chooseHost to returning either a host, or a detachable async handle, akin to what we do for most of our async calls. All currently LBs would return a host. Any current callers of chooseHost getting an async handle would immediately cancel and treat it as a host selection error since it's not yet supported. This is pretty much no risk as it's a return change where no one is using the new option

I think my main concern still is that I am not sure if this deserve this complexity. Esp considering that is complex enough. And I will prefer the unified way to do that. If we prepare go to async lb world, then I prefer to refactor all current implementations to use the async API to avoid to maintain two different mechanisms.

Btw, could you give this suggesion I am not sure, but is it possible to refresh the DNS when call the newStream() of generic pool? The upstream generic pool self is also an extension point. a think?
If we treat the host as a multiple addresses wrapper, I think maybe here could be a good position to refresh/select address.

@alyssawilk
Copy link
Contributor Author

I think the general deciding principle for Envoy features is if it has value we go for it. I'm think with Matt and Greg's thumbs up we've passed the check of "is this google specific (no)" so I'm going to focus on implementation.

Where I understand that for general principles a single API seems simpler, I think in this case it ends up adding unnecessary risk and complexity. I'm not saying I'm unwilling to go this route - it I think it'd be a fairly straight-forward addition to my PR chain, but I want to discuss pros and cons because I think it's net negative.

For stream selection we chose to either return an async handle (and stash it to cancel) or synchronously call the callback. This has caused probably a dozen crashing bugs over time, because 1) when the handle is returned you have to reference the handle before even knowing if you can store the handle in this->cancelable_ because "this" may already be deleted. Basically the synchronous callback made it really hard to reason about lifetime and caused a bunch of issues where we referenced the class in unwinding.

If we used the async API when chooseHost synchronously connected the you could end up with a situation where you had
[existing complicated downstream processing][router processing headers][connection manager starting host selection][load balancer doing host selection][router async callback continuing][upstream filter chain kicking in][upstream local reply causing stream teardown]
If you imagine there being a prefetched connection, the early reply and stream deletion causing the connection to go away, you end up with a deep stack, and much more code which has to handle both the downstream stream state and upstream stream and connection state going away safely.

conversely in the "return host or return cancellable" scenario, if you return a host the selection logic is completely unwound, so we have a comparable stack to what we had before with [start router decode headers] calling [finish router decode headers] instead of it being in one function. In the async case, we're out of much of the stack of the choose host logic (because we'd be at the tail end of "wait for sidecar or other thread to make final decision") and the few LBs which use it will be written knowing lifetime constraints.

Again I can be talked into going the other way, especially given the number of e2e test we have to cover things and the fact it'd be runtime guarded this is totally the type of thing we could start false by default and test in prod, so let me know your thoughts!

@wbpcode
Copy link
Member

wbpcode commented Oct 31, 2024

For stream selection we chose to either return an async handle (and stash it to cancel) or synchronously call the callback. This has caused probably a dozen crashing bugs over time, because 1) when the handle is returned you have to reference the handle before even knowing if you can store the handle in this->cancelable_ because "this" may already be deleted. Basically the synchronous callback made it really hard to reason about lifetime and caused a bunch of issues where we referenced the class in unwinding.

If we used the async API when chooseHost synchronously connected the you could end up with a situation where you had
[existing complicated downstream processing][router processing headers][connection manager starting host selection][load balancer doing host selection][router async callback continuing][upstream filter chain kicking in][upstream local reply causing stream teardown]
If you imagine there being a prefetched connection, the early reply and stream deletion causing the connection to go away, you end up with a deep stack, and much more code which has to handle both the downstream stream state and upstream stream and connection state going away safely.

This sounds very reasonable. The complex callbacks have bring enough troubles to us. (I thought of our http call which may call onFailure directly and bring much more complexity to handle the problem it brings.)

Okay, I will restart the review of #36537 next week.

Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Nov 30, 2024
@wbpcode wbpcode added no stalebot Disables stalebot from closing an issue and removed stale stalebot believes this issue/PR has not been touched recently labels Dec 2, 2024
alyssawilk added a commit that referenced this issue Dec 10, 2024
#36537)

This is the first step in a planned refactor to allow asynchronous host
selection

Risk Level: high (core code refactor)
Testing: existing tests pass
Docs Changes: n/a
Release Notes: n/a (until async selection is supported)

#36819

---------

Signed-off-by: Alyssa Wilk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

4 participants