Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/contour: HoldoffMaxDelay timer should increase if there are pending updates #2275

Closed
youngnick opened this issue Feb 24, 2020 · 4 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@youngnick
Copy link
Member

youngnick commented Feb 24, 2020

Currently, in handler.go, the holdoff timer has a maximum delay, after which it will fire a forced update.

As part of an investigation into large amount of memory being used by Envoy at startup, we came up with the theory that, if it takes longer than HoldoffMaxDelay to process all the objects in a cluster, we could end up firing multiple times, maybe a lot of times.

This would create a lot of separate Envoy configs very quickly, which would all also drain quickly (since there would be very few connections actually using each one).

A way to mitigate this may be to dynamically manage the holdoffMaxDelay, and increase it for the next time if updates are pending when a max-delay update occurs. It should reset to its original value once updates complete draining.

This would produce an backoff-style behaviour in Envoy updates when there are a lot of updates pending (such as at startup), which should hopefully help with not generating as many.

@youngnick youngnick added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 24, 2020
@davecheney
Copy link
Contributor

I'd like take a stab at this. The goal of holdoffMaxDelay was to prevent the holdoffDelay timer being sloloris'd forward indefinitely. At the moment holdoffMaxDelay will fire twice per second, even in the face of continual updates. It's reasonable that if we hit holdoffMaxDelay then there have been continual updates for that period of time so we can assume they will continue to arrive indefinitely. Doubling holdoffMaxDelay, if we hit it, should be safe to do because in the event holdoffMaxDelay has been extended many times, the holdoff timer will fire after holdoffDelay, a much shorter duration.

@davecheney davecheney changed the title internal\contour: HoldoffMaxDelay timer should increase if there are pending updates internal/contour: HoldoffMaxDelay timer should increase if there are pending updates Feb 24, 2020
@davecheney davecheney self-assigned this Feb 24, 2020
@davecheney davecheney added this to the 1.3.0 milestone Feb 24, 2020
davecheney added a commit to davecheney/contour that referenced this issue Feb 24, 2020
Updates projectcontour#2275

The holdoff logic is subtle, and under tested. To address both of these
issues I'm going to make the fix for projectcontour#2275 in several stages so we have
a chance to git bisect.

This first change removes the explicit force code path when
holdoffmaxdelay is exceeded. The logic remains the same, if
holdoffmaxdelay is exceeded a delayed update will occur, however after
this change rather than their being a specific code path for the forced
update we set the timer delay to zero, which will make the <-pending
channel ready shortly afterwards.

Strictly speaking a stream of <-op updates can trump a <-pending which
can add further delay to rebuilding the dag but in practice this is
unlikely to occur indefinitely;

1. When more than one channel is selectable, the runtime chooses between
them pseudo randomly.
2. If <-op updates occur continually it is in the users' interest that
the DAG rebuild is delayed, this is the point of 2275.

A future PR will refactor the delay calculation so that we can test it
in isolation.

Signed-off-by: Dave Cheney <[email protected]>
davecheney added a commit to davecheney/contour that referenced this issue Feb 24, 2020
Updates projectcontour#2275

The holdoff logic is subtle, and under tested. To address both of these
issues I'm going to make the fix for projectcontour#2275 in several stages so we have
a chance to git bisect.

This first change removes the explicit force code path when
holdoffmaxdelay is exceeded. The logic remains the same, if
holdoffmaxdelay is exceeded a delayed update will occur, however after
this change rather than their being a specific code path for the forced
update we set the timer delay to zero, which will make the <-pending
channel ready shortly afterwards.

Strictly speaking a stream of <-op updates can trump a <-pending which
can add further delay to rebuilding the dag but in practice this is
unlikely to occur indefinitely;

1. When more than one channel is selectable, the runtime chooses between
them pseudo randomly.
2. If <-op updates occur continually it is in the users' interest that
the DAG rebuild is delayed, this is the point of 2275.

A future PR will refactor the delay calculation so that we can test it
in isolation.

Signed-off-by: Dave Cheney <[email protected]>
davecheney added a commit to davecheney/contour that referenced this issue Feb 24, 2020
Updates projectcontour#2275

The holdoff logic is subtle, and under tested. To address both of these
issues I'm going to make the fix for projectcontour#2275 in several stages so we have
a chance to git bisect.

This first change removes the explicit force code path when
holdoffmaxdelay is exceeded. The logic remains the same, if
holdoffmaxdelay is exceeded a delayed update will occur, however after
this change rather than their being a specific code path for the forced
update we set the timer delay to zero, which will make the <-pending
channel ready shortly afterwards.

Strictly speaking a stream of <-op updates can trump a <-pending which
can add further delay to rebuilding the dag but in practice this is
unlikely to occur indefinitely;

1. When more than one channel is selectable, the runtime chooses between
them pseudo randomly.
2. If <-op updates occur continually it is in the users' interest that
the DAG rebuild is delayed, this is the point of 2275.

A future PR will refactor the delay calculation so that we can test it
in isolation.

Signed-off-by: Dave Cheney <[email protected]>
davecheney added a commit that referenced this issue Feb 25, 2020
Updates #2275

The holdoff logic is subtle, and under tested. To address both of these
issues I'm going to make the fix for #2275 in several stages so we have
a chance to git bisect.

This first change removes the explicit force code path when
holdoffmaxdelay is exceeded. The logic remains the same, if
holdoffmaxdelay is exceeded a delayed update will occur, however after
this change rather than their being a specific code path for the forced
update we set the timer delay to zero, which will make the <-pending
channel ready shortly afterwards.

Strictly speaking a stream of <-op updates can trump a <-pending which
can add further delay to rebuilding the dag but in practice this is
unlikely to occur indefinitely;

1. When more than one channel is selectable, the runtime chooses between
them pseudo randomly.
2. If <-op updates occur continually it is in the users' interest that
the DAG rebuild is delayed, this is the point of 2275.

A future PR will refactor the delay calculation so that we can test it
in isolation.

Signed-off-by: Dave Cheney <[email protected]>
davecheney added a commit to davecheney/contour that referenced this issue Feb 25, 2020
Updates projectcontour#2275

Continue to chip away at the holdoff logic's hooks into the
EventHandler's state. This change renames e.last to a local
lastDAGUpdate variable and cleans up some incorrect comments.

Signed-off-by: Dave Cheney <[email protected]>
davecheney added a commit that referenced this issue Feb 25, 2020
Updates #2275

Continue to chip away at the holdoff logic's hooks into the
EventHandler's state. This change renames e.last to a local
lastDAGUpdate variable and cleans up some incorrect comments.

Signed-off-by: Dave Cheney <[email protected]>
@davecheney davecheney modified the milestones: 1.3.0, Backlog Feb 27, 2020
@davecheney
Copy link
Contributor

Hello,

I've noodled with this for a few days and landed some cleanup PRs for 1.3 however I have decided not to proceed any further with this work. This is for several reasons.

  1. After investigation of the interaction between envoy and contour during startup the following scenarios are likely.
    a. Contour starts before envoy; in this case Contour will have processed the updates from the API server before envoy connects. Envoy will see one update.
    b. Envoy starts before contour; this is actually a, but in a different guise. Because Contour does not open its xDS listening socket until the informers have synced during the startup phase Contour will not open its xDS socket, so envoy, unless it is extremely likely, will not connect and enter a backoff retry loop. When it does connect, the observed behaviour looks like a.
  2. I wanted to extract the holdout logic into its own type which would take a set of notify messages as input and return a delay value. This would isolate the holdout logic from having to test that something did not run for at least a duration. Unfortunately I haven't found a nice API for this logic yet.

On balance I think this is something that is worthwhile doing, but I can't justify it now with the current issue load.

@davecheney davecheney removed their assignment Mar 10, 2020
@skriss skriss removed this from the Backlog milestone Jul 25, 2022
Copy link

github-actions bot commented Mar 6, 2024

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 6, 2024
Copy link

github-actions bot commented Apr 6, 2024

The Contour project currently lacks enough contributors to adequately respond to all Issues.

This bot triages Issues according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, the Issue is closed

You can:

  • Mark this Issue as fresh by commenting
  • Close this Issue
  • Offer to help out with triage

Please send feedback to the #contour channel in the Kubernetes Slack

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants