Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double the node creation retry #1286

Merged
merged 1 commit into from
Feb 8, 2022
Merged

Conversation

felix-zhe-huang
Copy link
Contributor

@felix-zhe-huang felix-zhe-huang commented Feb 7, 2022

1. Issue, if available:
When EC2 takes a longer time to assign the private DNS name of the newly created instance, Karpenter may failed all attends to create the node object due to missing private DNS name.
#1282

2. Description of changes:
Double the get instance retry to 6 times, one second apart.
This will double the time that a private DNS name can be assigned before Karpenter raises an error.

3. How was this change tested?
Unit tests, manually tested.

4. Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: link to issue
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Feb 7, 2022

✔️ Deploy Preview for karpenter-docs-prod ready!

🔨 Explore the source changes: 4c124a3

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/620180071ab3fa00077b2925

😎 Browse the preview: https://deploy-preview-1286--karpenter-docs-prod.netlify.app

@@ -59,7 +59,7 @@ func (p *InstanceProvider) Create(ctx context.Context, constraints *v1alpha1.Con
if err := retry.Do(
func() (err error) { instances, err = p.getInstances(ctx, ids); return err },
retry.Delay(1*time.Second),
retry.Attempts(3),
retry.Attempts(6),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this grow exponentially? It would also be nice to reduce the delay to something like p50 propagation time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm -- maybe we don't want to increase because the first call is 0 seconds of delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand this correctly, we need to increase the overall time before raising an error so that AWS can have enough time to assign the private DNS name. So if we also shorten the delay, we do not increase the overall time window but just increase the sampling number, that doesn't necessary help.

As for growing exponentially can you elaborate a bit more? I think it will just try 3 more times with constant runtime in each iteration. And for most of the cases we shouldn't even see retry. So the worst case scenario is that we will see a 6 second delay. But this should be extremely rare.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have good data on tail latencies for this propagation, but it's not uncommon for tail latencies to be quite a bit longer than p50 latency. e.g., I could imagine that p50 = 100ms, but p99 = 30 seconds. Since tails latencies tend to be logarithmic, an exponential is a decent model to minimize the number of retries in all cases.

Copy link
Contributor

@tzneal tzneal Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at avast/retry and it has exponential backoff built in by default.

A quick test shows that is works, and using a delay of 1 second with 6 attempts yields the desired 30 second delay:

try count 0 current delay 0s cumulative delay 600ns
try count 1 current delay 1.048829547s cumulative delay 1.048830347s
try count 2 current delay 2.08424261s cumulative delay 3.133073157s
try count 3 current delay 4.06698648s cumulative delay 7.200060237s
try count 4 current delay 8.04234009s cumulative delay 15.242400427s
try count 5 current delay 16.090066559s cumulative delay 31.332467186s

It may be worth investigating if any other code using retry doesn't expect exponential backoff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the default behavior is indeed exponential backoff. I did a quick scan and this is the only place that we use avast/retry-go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Karpenter fails to create node object if an instance does not have PrivateDnsName still an issue
3 participants