Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ever-accumulating memory in logger #5284

Merged
merged 1 commit into from
Nov 18, 2024

Conversation

nojnhuh
Copy link
Contributor

@nojnhuh nojnhuh commented Nov 18, 2024

What type of PR is this?
/kind bug

What this PR does / why we need it:

This PR fixes an issue where the long-lived loggers created in the SetupWithManager methods for each controller were accumulating a constantly-growing set of key/value pairs that could never be garbage collected.

Here is a Prometheus graph showing the fix enabled halfway through:
memleak

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5245

Special notes for your reviewer:

Reproducing this is actually fairly straightforward. I was testing this by creating about 10 spec.paused: true Clusters from the machinepool flavor in Tilt, then running this command in 3 or 4 terminal windows to spam updates to the CAPZ resources to generate logs:

while true; do kubectl annotate azurecluster,amp,azuremachine --all --overwrite jon=$(head -c 32 /dev/urandom | base64); done

Pretty much immediately in Prometheus you'll see memory start to climb.

  • cherry-pick candidate

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Fixed an issue where CAPZ was accumulating memory over time that could never be garbage collected.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 18, 2024
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 18, 2024
@nojnhuh
Copy link
Contributor Author

nojnhuh commented Nov 18, 2024

/cherry-pick release-1.16
/cherry-pick release-1.17
/assign @willie-yao

@k8s-infra-cherrypick-robot

@nojnhuh: once the present PR merges, I will cherry-pick it on top of release-1.16, release-1.17 in new PRs and assign them to you.

In response to this:

/cherry-pick release-1.16
/cherry-pick release-1.17
/assign @willie-yao

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@nojnhuh
Copy link
Contributor Author

nojnhuh commented Nov 18, 2024

Thanks to @andreipantelimon and @dkoshkin for all the detail provided in #5245 and #2639, that was immensely helpful!

Copy link

codecov bot commented Nov 18, 2024

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.

Project coverage is 52.53%. Comparing base (7e9ebb9) to head (4ed21b4).
Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
util/tele/span_logger.go 77.77% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5284      +/-   ##
==========================================
- Coverage   53.00%   52.53%   -0.47%     
==========================================
  Files         272      272              
  Lines       29429    29433       +4     
==========================================
- Hits        15598    15463     -135     
- Misses      13027    13167     +140     
+ Partials      804      803       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Comment on lines -91 to -92
func (s *spanLogSink) WithValues(keysAndValues ...interface{}) logr.LogSink {
s.vals = append(s.vals, keysAndValues...)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with this is that s is being mutated through the pointer receiver, so this slice will grow forever and can't be garbage collected. The method receiver here is changed to a value, which fixes the memory issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +92 to +96
// always create a new slice to avoid multiple loggers writing to the same backing array
vals := make([]interface{}, len(s.vals)+len(keysAndValues))
copy(vals, s.vals)
copy(vals[len(s.vals):], keysAndValues)
s.vals = vals
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond just the memory issue, I noticed a separate issue where a plain append could be clobbering the values in other sibling loggers when the backing array happens to have extra capacity. Replacing all of this with s.vals = append(s.vals, keysAndValues...) and running the new unit test should show that issue.

@dtzar dtzar added this to the v1.18 milestone Nov 18, 2024
Copy link
Contributor

@willie-yao willie-yao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 18, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b9d9a1484b33849f7c9ffce92f878b4365821cf4

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: willie-yao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 18, 2024
@k8s-ci-robot k8s-ci-robot merged commit b972143 into kubernetes-sigs:main Nov 18, 2024
21 checks passed
@k8s-infra-cherrypick-robot

@nojnhuh: new pull request created: #5285

In response to this:

/cherry-pick release-1.16
/cherry-pick release-1.17
/assign @willie-yao

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-infra-cherrypick-robot

@nojnhuh: new pull request created: #5286

In response to this:

/cherry-pick release-1.16
/cherry-pick release-1.17
/assign @willie-yao

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andreipantelimon
Copy link

Hey, thank you for the fix. It will greatly improve our system!
@nojnhuh can you cut a patch release with this fix?

@willie-yao
Copy link
Contributor

Hey, thank you for the fix. It will greatly improve our system! @nojnhuh can you cut a patch release with this fix?

We are planning to release a new minor version along with patch releases this week!

@k8s-infra-cherrypick-robot

@willie-yao: new pull request could not be created: failed to create pull request against kubernetes-sigs/cluster-api-provider-azure#release-1.16 from head k8s-infra-cherrypick-robot:cherry-pick-5284-to-release-1.16: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"No commits between kubernetes-sigs:release-1.16 and k8s-infra-cherrypick-robot:cherry-pick-5284-to-release-1.16"}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request","status":"422"}

In response to this:

/cherry-pick release-1.17
/cherry-pick release-1.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@willie-yao
Copy link
Contributor

/cherry-pick release-1.17

@k8s-infra-cherrypick-robot

@willie-yao: new pull request could not be created: failed to create pull request against kubernetes-sigs/cluster-api-provider-azure#release-1.17 from head k8s-infra-cherrypick-robot:cherry-pick-5284-to-release-1.17: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"No commits between kubernetes-sigs:release-1.17 and k8s-infra-cherrypick-robot:cherry-pick-5284-to-release-1.17"}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request","status":"422"}

In response to this:

/cherry-pick release-1.17

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@willie-yao
Copy link
Contributor

ah it has already been cherry-picked, my b!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

CAPZ Controller Memory Leak
6 participants