-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retries for kubeadm join / UpdateStatus #2092
Comments
During the update status phase, we do the following 3 API calls:
The first question here is, should we consider a timeout for the whole phase or per API call? Having too big timeouts on per-operation basis might frustrate end users. Having too short timeout will cause failures of the nature seen by the Cluster API folks. |
the ticket in CAPI, proposed that CAPI should be providing some metrics in terms of retires, yet this level of granularity will be hard to scope for them.
my vote goes for per-api call.
there is no sane answer for this. a common timeout of exp back capping around 40 sec makes sense to me for general API calls. BTW, at this point we seem to be applying a number of different backoffs, timeouts and different retry mechanics in different places which is increasing the tech dept in kubeadm. |
I have no strong opinions about applying retries per call or per phase, by considering the need for backporting I'm +1 for the simplest solution during this iteration FYI current approach in clusterctl is to have retry loops for small groups of API calls (not for a single call) and everything is standardized around three backoff configurations:
There are also special timeouts that apply to critical steps to the process (similar to wait for the API server or wait for TLS bootstrap in kubeadm) |
I'd like to help with this. |
@xlgao-zju hi, code freeze for 1.19 is June 25th. |
will send the PR before June 15th, os that you reviewers will enough time to review the PR. |
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
kubeadm version: v1.17.*
What happened?
While executing Cluster API tests, in some cases it was observed kubeadm join failures when updating the kubeadm-config config map
xref kubernetes-sigs/cluster-api#2769
What you expected to happen?
To make update status more resilient by adding a retry loop to this operation
How to reproduce it (as minimally and precisely as possible)?
This error happens only sometimes, most probably when there is a temporary blackout of the load balancer that sits in front of the API servers (HA proxy reloading his configuration).
Also, the error might happen when the new API server enters the load balancing pool but the underlying etcd member is not yet available due to slow network/slow I/O causing delays in etcd getting online or in some cases, also change fo the etcd leader.
Anything else we need to know?
Important: if possible the change should be kept as small and possible and backported
The text was updated successfully, but these errors were encountered: