-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fault: only check for overflow when performing faults #12843
Conversation
This shuffles the check for overflow around so that we only increment the overflow value when attempting to apply a fault instead of for each request. This ensures that if applying a low % fault the overflow stat will be in line with that, instead of counting each request once the overflow threshold has been hit. Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Snow Pettersen <[email protected]>
// single request might increment the counter more than once if it tries to apply multiple faults, | ||
// though it is also possible for it to fail the first check then succeed on the second (should | ||
// another thread decrement the fault). | ||
// TODO(snowp): Is this behavior ideal? Should we track whether we've rejected faults for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the main open question in my mind, not sure how important we feel this is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that this is a big issue for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's probably fine either way. I would go with whatever is easiest and most clear to implement. Can you remove the TODO and make it clear what we have chosen? Also not sure if it's worth it to add docs on the stat that makes this mroe clear, up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docs imply the behavior I changed it to - it counts the number of faults were skipped due to overflow - so I'll leave it at that. Updated the comment a bit and removed the TODO
@Augustyniak can you take a first pass? |
Sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 nit comments
// single request might increment the counter more than once if it tries to apply multiple faults, | ||
// though it is also possible for it to fail the first check then succeed on the second (should | ||
// another thread decrement the fault). | ||
// TODO(snowp): Is this behavior ideal? Should we track whether we've rejected faults for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think that this is a big issue for us.
@@ -391,15 +401,29 @@ FaultFilterStats FaultFilterConfig::generateStats(const std::string& prefix, Sta | |||
POOL_GAUGE_PREFIX(scope, final_prefix))}; | |||
} | |||
|
|||
void FaultFilter::maybeIncActiveFaults() { | |||
bool FaultFilter::maybeDoFault() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function doesn't really do fault - it just decides whether a fault can be injected or not. What do you think about renaming it to something like canApplyFault()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I tried to capture the fact that we also increment the gauge here, but I agree that it's a bit misleading. I'll go with your suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually will do tryIncActiveFaults
which I think is a bit clearer
|
||
EXPECT_CALL(runtime_.snapshot_, getInteger("fault.http.max_active_faults", 0)) | ||
EXPECT_CALL(runtime_.snapshot_, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in line #528 - can we verify that active_faults counter is equal to 0 when there is a faults_overflow.
Signed-off-by: Snow Pettersen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with small comment, thank you!
/wait
// single request might increment the counter more than once if it tries to apply multiple faults, | ||
// though it is also possible for it to fail the first check then succeed on the second (should | ||
// another thread decrement the fault). | ||
// TODO(snowp): Is this behavior ideal? Should we track whether we've rejected faults for this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's probably fine either way. I would go with whatever is easiest and most clear to implement. Can you remove the TODO and make it clear what we have chosen? Also not sure if it's worth it to add docs on the stat that makes this mroe clear, up to you.
Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Snow Pettersen <[email protected]>
This shuffles the check for overflow around so that we only increment the overflow value when attempting to apply a fault instead of for each request. This ensures that if applying a low % fault the overflow stat will be in line with that, instead of counting each request once the overflow threshold has been hit. Signed-off-by: Snow Pettersen <[email protected]> Signed-off-by: Clara Andrew-Wani <[email protected]>
This shuffles the check for overflow around so that we only increment
the overflow value when attempting to apply a fault instead of for
each request. This ensures that if applying a low % fault the overflow
stat will be in line with that, instead of counting each request once
the overflow threshold has been hit.
Signed-off-by: Snow Pettersen [email protected]
Risk Level: Medium
Testing: New UT, existing tests
Docs Changes: n/a
Release Notes: n/a
Fixes #12816