You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Background
Currently a cluster is fully functional after cluster is warmed up and host members is propagated to worker thread.
The former enables obtain a ThreadLocalCluster by the name of the cluster.
The latter supports LB when a upstream connection is needed by a router.
Prior to on-demand CDS, the two phases are distinguished by the error details but not many users need to understand the concrete reason.
However, in on-demand CDS, the expectation is a little different. The downstream filter is expected to be waiting until the cluster is fully ready.
Root Cause
From main thread perspective, the first host member update and the resumption of the router filter are concurrent at work threads.
Chances are the router filter are resumed before the first member is delivered,
thus the first bunch of requests using on-demand CDS are failing because of "no healthy upstream host".
Proposal
I am considering adding another API to cluster manager, namely
This new function can be deemed as an extended ClusterDiscoveryCallbackHandlePtr requestOnDemandClusterDiscovery()
that addressed the issue.
The current requestOnDemandClusterDiscovery calls this new waitForNewHost()
and hide the details of first host update.
This API could be adopted even if the cluster is not on-demand. There are known cases that all the hosts are removed during the cluster update and retry policy is not helping
Alternatives
Consider the above unlucky sequences as a known failure and improve the each retry policy (of each protocol) to handle it.
Currently TcpProxy and HCM fail fast on this condition.
The text was updated successfully, but these errors were encountered:
Description:
Upon OnDemand cluster returns an available cluster, the requests waiting for that cluster may fail due to hosts are not added to the cluster.
Detected by Test case TcpProxyOdcdsIntegrationTest, SingleTcpClient
https://github.com/envoyproxy/envoy/blob/main/test/integration/tcp_proxy_odcds_integration_test.cc#L130
Background
Currently a cluster is fully functional after cluster is warmed up and host members is propagated to worker thread.
The former enables obtain a ThreadLocalCluster by the name of the cluster.
The latter supports LB when a upstream connection is needed by a router.
Prior to on-demand CDS, the two phases are distinguished by the error details but not many users need to understand the concrete reason.
However, in on-demand CDS, the expectation is a little different. The downstream filter is expected to be waiting until the cluster is fully ready.
Root Cause
From main thread perspective, the first host member update and the resumption of the router filter are concurrent at work threads.
Chances are the router filter are resumed before the first member is delivered,
thus the first bunch of requests using on-demand CDS are failing because of "no healthy upstream host".
Proposal
I am considering adding another API to cluster manager, namely
NewHostCallback ThreadLocalCluster::waitForNewHost()
This new function can be deemed as an extended
ClusterDiscoveryCallbackHandlePtr requestOnDemandClusterDiscovery()
that addressed the issue.
The current requestOnDemandClusterDiscovery calls this new
waitForNewHost()
and hide the details of first host update.
This API could be adopted even if the cluster is not on-demand. There are known cases that all the hosts are removed during the cluster update and retry policy is not helping
Alternatives
Consider the above unlucky sequences as a known failure and improve the each retry policy (of each protocol) to handle it.
Currently TcpProxy and HCM fail fast on this condition.
The text was updated successfully, but these errors were encountered: