-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double the node creation retry #1286
Conversation
✔️ Deploy Preview for karpenter-docs-prod ready! 🔨 Explore the source changes: 4c124a3 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/620180071ab3fa00077b2925 😎 Browse the preview: https://deploy-preview-1286--karpenter-docs-prod.netlify.app |
@@ -59,7 +59,7 @@ func (p *InstanceProvider) Create(ctx context.Context, constraints *v1alpha1.Con | |||
if err := retry.Do( | |||
func() (err error) { instances, err = p.getInstances(ctx, ids); return err }, | |||
retry.Delay(1*time.Second), | |||
retry.Attempts(3), | |||
retry.Attempts(6), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this grow exponentially? It would also be nice to reduce the delay to something like p50 propagation time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm -- maybe we don't want to increase because the first call is 0 seconds of delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand this correctly, we need to increase the overall time before raising an error so that AWS can have enough time to assign the private DNS name. So if we also shorten the delay, we do not increase the overall time window but just increase the sampling number, that doesn't necessary help.
As for growing exponentially can you elaborate a bit more? I think it will just try 3 more times with constant runtime in each iteration. And for most of the cases we shouldn't even see retry. So the worst case scenario is that we will see a 6 second delay. But this should be extremely rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have good data on tail latencies for this propagation, but it's not uncommon for tail latencies to be quite a bit longer than p50 latency. e.g., I could imagine that p50 = 100ms, but p99 = 30 seconds. Since tails latencies tend to be logarithmic, an exponential is a decent model to minimize the number of retries in all cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at avast/retry and it has exponential backoff built in by default.
A quick test shows that is works, and using a delay of 1 second with 6 attempts yields the desired 30 second delay:
try count 0 current delay 0s cumulative delay 600ns
try count 1 current delay 1.048829547s cumulative delay 1.048830347s
try count 2 current delay 2.08424261s cumulative delay 3.133073157s
try count 3 current delay 4.06698648s cumulative delay 7.200060237s
try count 4 current delay 8.04234009s cumulative delay 15.242400427s
try count 5 current delay 16.090066559s cumulative delay 31.332467186s
It may be worth investigating if any other code using retry doesn't expect exponential backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the default behavior is indeed exponential backoff. I did a quick scan and this is the only place that we use avast/retry-go.
1. Issue, if available:
When EC2 takes a longer time to assign the private DNS name of the newly created instance, Karpenter may failed all attends to create the node object due to missing private DNS name.
#1282
2. Description of changes:
Double the get instance retry to 6 times, one second apart.
This will double the time that a private DNS name can be assigned before Karpenter raises an error.
3. How was this change tested?
Unit tests, manually tested.
4. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.