internal/contour: HoldoffMaxDelay timer should increase if there are pending updates #2275

youngnick · 2020-02-24T00:42:25Z

Currently, in handler.go, the holdoff timer has a maximum delay, after which it will fire a forced update.

As part of an investigation into large amount of memory being used by Envoy at startup, we came up with the theory that, if it takes longer than HoldoffMaxDelay to process all the objects in a cluster, we could end up firing multiple times, maybe a lot of times.

This would create a lot of separate Envoy configs very quickly, which would all also drain quickly (since there would be very few connections actually using each one).

A way to mitigate this may be to dynamically manage the holdoffMaxDelay, and increase it for the next time if updates are pending when a max-delay update occurs. It should reset to its original value once updates complete draining.

This would produce an backoff-style behaviour in Envoy updates when there are a lot of updates pending (such as at startup), which should hopefully help with not generating as many.

The text was updated successfully, but these errors were encountered:

davecheney · 2020-02-24T02:46:51Z

I'd like take a stab at this. The goal of holdoffMaxDelay was to prevent the holdoffDelay timer being sloloris'd forward indefinitely. At the moment holdoffMaxDelay will fire twice per second, even in the face of continual updates. It's reasonable that if we hit holdoffMaxDelay then there have been continual updates for that period of time so we can assume they will continue to arrive indefinitely. Doubling holdoffMaxDelay, if we hit it, should be safe to do because in the event holdoffMaxDelay has been extended many times, the holdoff timer will fire after holdoffDelay, a much shorter duration.

Updates projectcontour#2275 The holdoff logic is subtle, and under tested. To address both of these issues I'm going to make the fix for projectcontour#2275 in several stages so we have a chance to git bisect. This first change removes the explicit force code path when holdoffmaxdelay is exceeded. The logic remains the same, if holdoffmaxdelay is exceeded a delayed update will occur, however after this change rather than their being a specific code path for the forced update we set the timer delay to zero, which will make the <-pending channel ready shortly afterwards. Strictly speaking a stream of <-op updates can trump a <-pending which can add further delay to rebuilding the dag but in practice this is unlikely to occur indefinitely; 1. When more than one channel is selectable, the runtime chooses between them pseudo randomly. 2. If <-op updates occur continually it is in the users' interest that the DAG rebuild is delayed, this is the point of 2275. A future PR will refactor the delay calculation so that we can test it in isolation. Signed-off-by: Dave Cheney <[email protected]>

Updates #2275 The holdoff logic is subtle, and under tested. To address both of these issues I'm going to make the fix for #2275 in several stages so we have a chance to git bisect. This first change removes the explicit force code path when holdoffmaxdelay is exceeded. The logic remains the same, if holdoffmaxdelay is exceeded a delayed update will occur, however after this change rather than their being a specific code path for the forced update we set the timer delay to zero, which will make the <-pending channel ready shortly afterwards. Strictly speaking a stream of <-op updates can trump a <-pending which can add further delay to rebuilding the dag but in practice this is unlikely to occur indefinitely; 1. When more than one channel is selectable, the runtime chooses between them pseudo randomly. 2. If <-op updates occur continually it is in the users' interest that the DAG rebuild is delayed, this is the point of 2275. A future PR will refactor the delay calculation so that we can test it in isolation. Signed-off-by: Dave Cheney <[email protected]>

Updates projectcontour#2275 Continue to chip away at the holdoff logic's hooks into the EventHandler's state. This change renames e.last to a local lastDAGUpdate variable and cleans up some incorrect comments. Signed-off-by: Dave Cheney <[email protected]>

Updates #2275 Continue to chip away at the holdoff logic's hooks into the EventHandler's state. This change renames e.last to a local lastDAGUpdate variable and cleans up some incorrect comments. Signed-off-by: Dave Cheney <[email protected]>

davecheney · 2020-02-27T04:32:07Z

Hello,

I've noodled with this for a few days and landed some cleanup PRs for 1.3 however I have decided not to proceed any further with this work. This is for several reasons.

After investigation of the interaction between envoy and contour during startup the following scenarios are likely.
a. Contour starts before envoy; in this case Contour will have processed the updates from the API server before envoy connects. Envoy will see one update.
b. Envoy starts before contour; this is actually a, but in a different guise. Because Contour does not open its xDS listening socket until the informers have synced during the startup phase Contour will not open its xDS socket, so envoy, unless it is extremely likely, will not connect and enter a backoff retry loop. When it does connect, the observed behaviour looks like a.
I wanted to extract the holdout logic into its own type which would take a set of notify messages as input and return a delay value. This would isolate the holdout logic from having to test that something did not run for at least a duration. Unfortunately I haven't found a nice API for this logic yet.

On balance I think this is something that is worthwhile doing, but I can't justify it now with the current issue load.

github-actions · 2024-03-06T00:20:00Z

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

github-actions · 2024-04-06T00:18:57Z

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

After 60d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

Mark this Issue as fresh by commenting
Close this Issue
Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

youngnick added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 24, 2020

davecheney changed the title ~~internal\contour: HoldoffMaxDelay timer should increase if there are pending updates~~ internal/contour: HoldoffMaxDelay timer should increase if there are pending updates Feb 24, 2020

davecheney self-assigned this Feb 24, 2020

davecheney added this to the 1.3.0 milestone Feb 24, 2020

davecheney mentioned this issue Feb 24, 2020

internal/contour: remove holdoffmaxdelay forced update #2285

Merged

davecheney mentioned this issue Feb 25, 2020

internal/contour: further clean up holdoff timer logic #2289

Merged

davecheney modified the milestones: 1.3.0, Backlog Feb 27, 2020

davecheney removed their assignment Mar 10, 2020

skriss removed this from the Backlog milestone Jul 25, 2022

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal/contour: HoldoffMaxDelay timer should increase if there are pending updates #2275

internal/contour: HoldoffMaxDelay timer should increase if there are pending updates #2275

youngnick commented Feb 24, 2020 •

edited

Loading

davecheney commented Feb 24, 2020

davecheney commented Feb 27, 2020

github-actions bot commented Mar 6, 2024

github-actions bot commented Apr 6, 2024

internal/contour: HoldoffMaxDelay timer should increase if there are pending updates #2275

internal/contour: HoldoffMaxDelay timer should increase if there are pending updates #2275

Comments

youngnick commented Feb 24, 2020 • edited Loading

davecheney commented Feb 24, 2020

davecheney commented Feb 27, 2020

github-actions bot commented Mar 6, 2024

github-actions bot commented Apr 6, 2024

youngnick commented Feb 24, 2020 •

edited

Loading