Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAULT-28192 fix Agent and Proxy consuming large amounts of CPU for auto-auth self-healing #27518

Merged
merged 5 commits into from
Jun 19, 2024

Conversation

VioletHynes
Copy link
Contributor

@VioletHynes VioletHynes commented Jun 17, 2024

Description

Fixes an issue introduced in 1.17 where CPU usage in Agent and Proxy are extremely high due to the code taking the same path down a select statement repeatedly (in an infinite loop).

Will be backported to 1.17.

Fixes #27505

TODO only if you're a HashiCorp employee

  • Labels: If this PR is the CE portion of an ENT change, and that ENT change is
    getting backported to N-2, use the new style backport/ent/x.x.x+ent labels
    instead of the old style backport/x.x.x labels.
  • Labels: If this PR is a CE only change, it can only be backported to N, so use
    the normal backport/x.x.x label (there should be only 1).
  • ENT Breakage: If this PR either 1) removes a public function OR 2) changes the signature
    of a public function, even if that change is in a CE file, double check that
    applying the patch for this PR to the ENT repo and running tests doesn't
    break any tests. Sometimes ENT only tests rely on public functions in CE
    files.
  • Jira: If this change has an associated Jira, it's referenced either
    in the PR description, commit message, or branch name.
  • RFC: If this change has an associated RFC, please link it in the description.
  • ENT PR: If this change has an associated ENT PR, please link it in the
    description. Also, make sure the changelog is in this PR, not in your ENT PR.

@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Jun 17, 2024
Copy link

github-actions bot commented Jun 17, 2024

CI Results:
All Go tests succeeded! ✅

@VioletHynes VioletHynes added this to the 1.17.1 milestone Jun 17, 2024
@VioletHynes VioletHynes marked this pull request as ready for review June 17, 2024 16:05
Copy link

github-actions bot commented Jun 17, 2024

Build Results:
All builds succeeded! ✅

@jasonodonnell jasonodonnell self-requested a review June 17, 2024 17:59
Copy link
Contributor

@jasonodonnell jasonodonnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

invalidTokenCh <- err
}
default:
case err := <-ts.runner.ServerErrCh:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is a possible scenario -

  • IncomingToken receives a new token at the same time template sends an error back to ServerErrCh
  • We reauthenticate first by honoring the ServerErrCh select first. Now IncomingCh has two values in the channel
  • We try using the first token in IncomingCh but there is an error. Another error is sent to ServerErrCh
  • Again the SeverErrCh is honored first and we reauthenticate.

We are now stuck in a loop where we always honor the token one behind valid token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's a good point! I do think it's likely in this scenario that both tokens will be valid, but it's still not a great state to be in. I'll rework this to drain the incoming channel in the same place we drain the invalid token channel. I think that should prevent any looping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely understand why we had it the way we had it before though, but I do think this might be the best fix, and the only situation it would struggle is if we have the two channels filled exactly simultaneously

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this!! Thanks for adding Violet!

@VioletHynes VioletHynes merged commit 3959722 into main Jun 19, 2024
82 of 83 checks passed
@VioletHynes VioletHynes deleted the violethynes/VAULT-28192 branch June 19, 2024 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

please undo 1.17: severe CPU usage
3 participants