Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/kates: fix bug where Accumulator was not coalescing changes mid-watch #4488

Merged
merged 4 commits into from
Sep 15, 2022

Conversation

haq204
Copy link

@haq204 haq204 commented Sep 6, 2022

Description

The Accumulator struct attempts to coalesce changes into a single snapshot update as a way to do graceful load shedding.
However, while this was the behavior on bootstrap, it didn't always happen mid-watch - each event that was received turned into a single snapshot update, thus not really satisfying this requirement.

We add a new option to batch changes for a specified window interval before sending a snapshot update.
The batching behavior is as follows:

  • The Accumulator will receive raw changes up until the window period where it will then send a change, even if new updates are still coming in.
    This is to prevent the potential of a scenario where a change is never sent due to an extremely volatile cluster.
    While there may be a way to be more dynamic in how long to wait before sending this change, this approach is simpler and more predicable.

  • If an isolated updated comes in (e.g. last change was submitted an hour ago but the window period is set to 10s), it may not neccessarily wait until the window period before sending change, it can send immediately.

  • The default interval is set to 1s to be inline with current change velocity.

  • A snapshot update won't be sent until all resources are fully bootstrapped, regardless of what interval is set.
    This is the ensure that the other requirements for the Accumulator are still satisfied.

AMBASSADOR_RECONFIG_MAX_DELAY controls the interval to wait before sending snapshot updates when listening for K8s resources, especially when many resources are updated in quick succession.

Related Issues

Issue is not in this repo.

Testing

Add new test cases.

For performance testing, ran a test that concurrently applies namespaces with each namespace applying 35 deployments, host, and mappings each. Each deployment has 2 replicas. We track the number of snapshot versions pushed. Prevously the number of snapshot versions created when applying 1 namespace was ~118. After setting a 10s interval, the snapshot version reduced to 16. For max concurrent namespaces it would previously start OOMing at 6 concurrent namespaces; with a 10 sec interval it's at least >15. Peak memory usage was 600MB so with a higher memory limit, it can likely support much higher.

Checklist

  • I made sure to update CHANGELOG.md.

    Remember, the CHANGELOG needs to mention:

    • Any new features
    • Any changes to our included version of Envoy
    • Any non-backward-compatible changes
    • Any deprecations
  • This is unlikely to impact how Ambassador performs at scale - load testing shows xxxx.

    Remember, things that might have an impact at scale include:

    • Any significant changes in memory use that might require adjusting the memory limits
    • Any significant changes in CPU use that might require adjusting the CPU limits
    • Anything that might change how many replicas users should use
    • Changes that impact data-plane latency/scalability
  • My change is adequately tested.

    Remember when considering testing:

    • Your change needs to be specifically covered by tests.
      • Tests need to cover all the states where your change is relevant: for example, if you add a behavior that can be enabled or disabled, you'll need tests that cover the enabled case and tests that cover the disabled case. It's not sufficient just to test with the behavior enabled.
    • You also need to make sure that the entire area being changed has adequate test coverage.
      • If existing tests don't actually cover the entire area being changed, add tests.
      • This applies even for aspects of the area that you're not changing – check the test coverage, and improve it if needed!
    • We should lean on the bulk of code being covered by unit tests, but...
    • ... an end-to-end test should cover the integration points
  • I updated DEVELOPING.md with any any special dev tricks I had to use to work on this code efficiently - N/A.

  • The changes in this PR have been reviewed for security concerns and adherence to security best practices.

@haq204 haq204 changed the base branch from master to release/v2.4 September 6, 2022 22:05
@haq204 haq204 force-pushed the hqudsi/kates branch 2 times, most recently from 6087e1f to 051eb11 Compare September 7, 2022 18:23
@haq204 haq204 changed the base branch from release/v2.4 to master September 8, 2022 17:34
@haq204 haq204 force-pushed the hqudsi/kates branch 2 times, most recently from 5271ca9 to 8df2095 Compare September 8, 2022 19:29
@haq204 haq204 changed the title pkg/kates: fix bug where Accumulator was not coalescing changes mid-watch WIP pkg/kates: fix bug where Accumulator was not coalescing changes mid-watch Sep 8, 2022
@haq204 haq204 marked this pull request as ready for review September 12, 2022 21:09
@haq204 haq204 requested review from LanceEa and LukeShu September 12, 2022 21:09
@haq204 haq204 changed the title WIP pkg/kates: fix bug where Accumulator was not coalescing changes mid-watch pkg/kates: fix bug where Accumulator was not coalescing changes mid-watch Sep 12, 2022
Copy link
Contributor

@LanceEa LanceEa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a few comments, questions and suggestions. I will ping you in the morning to go over a few items to make sure I'm understanding them all correctly.

BTW...awesome to see the improved testing results :)

cmd/entrypoint/entrypoint.go Outdated Show resolved Hide resolved
docs/releaseNotes.yml Outdated Show resolved Hide resolved
cmd/entrypoint/watcher.go Outdated Show resolved Hide resolved
pkg/kates/accumulator.go Outdated Show resolved Hide resolved
pkg/kates/accumulator.go Outdated Show resolved Hide resolved
pkg/kates/accumulator_test.go Show resolved Hide resolved
pkg/kates/accumulator_test.go Show resolved Hide resolved
@LanceEa LanceEa requested a review from ddymko September 13, 2022 03:34
@LanceEa
Copy link
Contributor

LanceEa commented Sep 13, 2022

@ddymko - I think it is important for you to have yours eyes on this just so you familiarize yourself with it a little bit.

@haq204 haq204 force-pushed the hqudsi/kates branch 2 times, most recently from a8d6656 to 64ca96c Compare September 13, 2022 14:41
TestBootStrapNoNotifyBeforeSync creates ConfigMaps during its tests which doesn't get cleaned up afterwards
potentially infecting other tests. Add t.Cleanup to the test to clean up those ConfigMaps after the test is finished.

Signed-off-by: Hamzah Qudsi <[email protected]>
@haq204 haq204 marked this pull request as draft September 13, 2022 19:19
@haq204
Copy link
Author

haq204 commented Sep 13, 2022

Going to make some changes after discussions with @LanceEa and @ddymko on expected batching behavior

@haq204 haq204 marked this pull request as ready for review September 14, 2022 18:52
docs/releaseNotes.yml Outdated Show resolved Hide resolved
pkg/kates/accumulator.go Outdated Show resolved Hide resolved
Copy link
Contributor

@LanceEa LanceEa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good based on our discussion. Just a quick question and I agree with David's feedback.

cmd/entrypoint/watcher.go Show resolved Hide resolved
@haq204 haq204 force-pushed the hqudsi/kates branch 3 times, most recently from 805ca7d to 9ded3cb Compare September 14, 2022 21:38
LanceEa
LanceEa previously approved these changes Sep 15, 2022
Copy link
Contributor

@LanceEa LanceEa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ddymko
Copy link
Member

ddymko commented Sep 15, 2022

@haq204 I think you have to rerun make generate because we cut the 3.2.0-rc.1 yesterday

Otherwise it looks good!

Hamzah Qudsi added 3 commits September 15, 2022 10:40
The Accumulator struct attempts to coalece changes into a single snapshot update as a way to do graceful load shedding.
However, while this was the behavior on bootstrap, it didn't always happen mid-watch - each event that was received turned into a single snapshot update, thus not really satisfying this requirement.

We add a new option to batch changes for a specified window interval before sending a snapshot update.
The batching behavior is as follows:
 - The Accumulator will receive raw changes up until the window period where it will then send a change, even if new updates are still coming in.
   This is to prevent the potential of a scenario where a change is never sent due to an extremely volatile cluster.
   While there may be a way to be more dynamic in how long to wait before sending this change, this approach is simpler and more predicable.

 - If an isolated updated comes in (e.g. last change was submitted an hour ago but the window period is set to 10s), it may not neccessarily wait until the window period before sending change, it can send immediately.

 - The default interval is set to 1s to be inline with current change velocity.

 - A snapshot update won't be sent until all resources are fully bootstrapped, regardless of what interval is set.
   This is the ensure that the other requirements for the Accumulator are still satisfied.

For testing, we add new test cases.

Signed-off-by: Hamzah Qudsi <[email protected]>
AMBASSADOR_RECONFIG_MAX_DELAY controls the interval to wait before sending snapshot updates when listening for K8s resources, especially when many resources are updated in quick succession.

Signed-off-by: Hamzah Qudsi <[email protected]>
Signed-off-by: Hamzah Qudsi <[email protected]>
Copy link
Member

@ddymko ddymko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@haq204 haq204 merged commit 1241eb5 into master Sep 15, 2022
@haq204 haq204 deleted the hqudsi/kates branch September 15, 2022 16:25
@haq204 haq204 mentioned this pull request Sep 15, 2022
5 tasks
haq204 pushed a commit that referenced this pull request Sep 15, 2022
@Alice-Lilith Alice-Lilith mentioned this pull request Sep 22, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants