Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The requests may fail when on demand CDS returns clusters #20873

Open
lambdai opened this issue Apr 18, 2022 · 2 comments
Open

The requests may fail when on demand CDS returns clusters #20873

lambdai opened this issue Apr 18, 2022 · 2 comments

Comments

@lambdai
Copy link
Contributor

lambdai commented Apr 18, 2022

Description:

Upon OnDemand cluster returns an available cluster, the requests waiting for that cluster may fail due to hosts are not added to the cluster.

Detected by Test case TcpProxyOdcdsIntegrationTest, SingleTcpClient
https://github.com/envoyproxy/envoy/blob/main/test/integration/tcp_proxy_odcds_integration_test.cc#L130

Background
Currently a cluster is fully functional after cluster is warmed up and host members is propagated to worker thread.

The former enables obtain a ThreadLocalCluster by the name of the cluster.
The latter supports LB when a upstream connection is needed by a router.

Prior to on-demand CDS, the two phases are distinguished by the error details but not many users need to understand the concrete reason.

However, in on-demand CDS, the expectation is a little different. The downstream filter is expected to be waiting until the cluster is fully ready.

Root Cause

From main thread perspective, the first host member update and the resumption of the router filter are concurrent at work threads.
Chances are the router filter are resumed before the first member is delivered,
thus the first bunch of requests using on-demand CDS are failing because of "no healthy upstream host".

Proposal
I am considering adding another API to cluster manager, namely

NewHostCallback ThreadLocalCluster::waitForNewHost()

This new function can be deemed as an extended
ClusterDiscoveryCallbackHandlePtr requestOnDemandClusterDiscovery()
that addressed the issue.

The current requestOnDemandClusterDiscovery calls this new waitForNewHost()
and hide the details of first host update.

This API could be adopted even if the cluster is not on-demand. There are known cases that all the hosts are removed during the cluster update and retry policy is not helping

Alternatives
Consider the above unlucky sequences as a known failure and improve the each retry policy (of each protocol) to handle it.
Currently TcpProxy and HCM fail fast on this condition.

@alyssawilk
Copy link
Contributor

cc @adisuissa @htuch

@alyssawilk alyssawilk removed the triage Issue requires triage label Apr 25, 2022
@htuch
Copy link
Member

htuch commented Apr 26, 2022

@krnowak

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants