The requests may fail when on demand CDS returns clusters #20873

lambdai · 2022-04-18T23:54:53Z

Description:

Upon OnDemand cluster returns an available cluster, the requests waiting for that cluster may fail due to hosts are not added to the cluster.

Detected by Test case TcpProxyOdcdsIntegrationTest, SingleTcpClient
https://github.com/envoyproxy/envoy/blob/main/test/integration/tcp_proxy_odcds_integration_test.cc#L130

Background
Currently a cluster is fully functional after cluster is warmed up and host members is propagated to worker thread.

The former enables obtain a ThreadLocalCluster by the name of the cluster.
The latter supports LB when a upstream connection is needed by a router.

Prior to on-demand CDS, the two phases are distinguished by the error details but not many users need to understand the concrete reason.

However, in on-demand CDS, the expectation is a little different. The downstream filter is expected to be waiting until the cluster is fully ready.

Root Cause

From main thread perspective, the first host member update and the resumption of the router filter are concurrent at work threads.
Chances are the router filter are resumed before the first member is delivered,
thus the first bunch of requests using on-demand CDS are failing because of "no healthy upstream host".

Proposal
I am considering adding another API to cluster manager, namely

NewHostCallback ThreadLocalCluster::waitForNewHost()

This new function can be deemed as an extended
ClusterDiscoveryCallbackHandlePtr requestOnDemandClusterDiscovery()
that addressed the issue.

The current requestOnDemandClusterDiscovery calls this new waitForNewHost()
and hide the details of first host update.

This API could be adopted even if the cluster is not on-demand. There are known cases that all the hosts are removed during the cluster update and retry policy is not helping

Alternatives
Consider the above unlucky sequences as a known failure and improve the each retry policy (of each protocol) to handle it.
Currently TcpProxy and HCM fail fast on this condition.

The text was updated successfully, but these errors were encountered:

alyssawilk · 2022-04-25T12:57:06Z

cc @adisuissa @htuch

htuch · 2022-04-26T04:12:39Z

@krnowak

lambdai added bug triage Issue requires triage labels Apr 18, 2022

lambdai mentioned this issue Apr 22, 2022

Use on-demand cluster discovery in on-demand extension #20065

Merged

alyssawilk removed the triage Issue requires triage label Apr 25, 2022

htuch added area/xds area/cluster_manager help wanted Needs help! labels Apr 26, 2022

lambdai mentioned this issue Apr 27, 2022

On-demand DNS resolution #20562

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The requests may fail when on demand CDS returns clusters #20873

The requests may fail when on demand CDS returns clusters #20873

lambdai commented Apr 18, 2022

alyssawilk commented Apr 25, 2022

htuch commented Apr 26, 2022

The requests may fail when on demand CDS returns clusters #20873

The requests may fail when on demand CDS returns clusters #20873

Comments

lambdai commented Apr 18, 2022

alyssawilk commented Apr 25, 2022

htuch commented Apr 26, 2022