Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fault: only check for overflow when performing faults #12843

Merged
merged 6 commits into from
Aug 29, 2020

Conversation

snowp
Copy link
Contributor

@snowp snowp commented Aug 27, 2020

This shuffles the check for overflow around so that we only increment
the overflow value when attempting to apply a fault instead of for
each request. This ensures that if applying a low % fault the overflow
stat will be in line with that, instead of counting each request once
the overflow threshold has been hit.

Signed-off-by: Snow Pettersen [email protected]

Risk Level: Medium
Testing: New UT, existing tests
Docs Changes: n/a
Release Notes: n/a
Fixes #12816

snowp added 3 commits August 27, 2020 15:16
This shuffles the check for overflow around so that we only increment
the overflow value when attempting to apply a fault instead of for
each request. This ensures that if applying a low % fault the overflow
stat will be in line with that, instead of counting each request once
the overflow threshold has been hit.

Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Snow Pettersen <[email protected]>
// single request might increment the counter more than once if it tries to apply multiple faults,
// though it is also possible for it to fail the first check then succeed on the second (should
// another thread decrement the fault).
// TODO(snowp): Is this behavior ideal? Should we track whether we've rejected faults for this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main open question in my mind, not sure how important we feel this is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that this is a big issue for us.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's probably fine either way. I would go with whatever is easiest and most clear to implement. Can you remove the TODO and make it clear what we have chosen? Also not sure if it's worth it to add docs on the stat that makes this mroe clear, up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docs imply the behavior I changed it to - it counts the number of faults were skipped due to overflow - so I'll leave it at that. Updated the comment a bit and removed the TODO

@mattklein123
Copy link
Member

@Augustyniak can you take a first pass?

@Augustyniak
Copy link
Contributor

@Augustyniak can you take a first pass?

Sure.

Copy link
Contributor

@Augustyniak Augustyniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 nit comments

// single request might increment the counter more than once if it tries to apply multiple faults,
// though it is also possible for it to fail the first check then succeed on the second (should
// another thread decrement the fault).
// TODO(snowp): Is this behavior ideal? Should we track whether we've rejected faults for this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that this is a big issue for us.

@@ -391,15 +401,29 @@ FaultFilterStats FaultFilterConfig::generateStats(const std::string& prefix, Sta
POOL_GAUGE_PREFIX(scope, final_prefix))};
}

void FaultFilter::maybeIncActiveFaults() {
bool FaultFilter::maybeDoFault() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't really do fault - it just decides whether a fault can be injected or not. What do you think about renaming it to something like canApplyFault()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tried to capture the fact that we also increment the gauge here, but I agree that it's a bit misleading. I'll go with your suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually will do tryIncActiveFaults which I think is a bit clearer


EXPECT_CALL(runtime_.snapshot_, getInteger("fault.http.max_active_faults", 0))
EXPECT_CALL(runtime_.snapshot_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in line #528 - can we verify that active_faults counter is equal to 0 when there is a faults_overflow.

Signed-off-by: Snow Pettersen <[email protected]>
Augustyniak
Augustyniak previously approved these changes Aug 28, 2020
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with small comment, thank you!

/wait

// single request might increment the counter more than once if it tries to apply multiple faults,
// though it is also possible for it to fail the first check then succeed on the second (should
// another thread decrement the fault).
// TODO(snowp): Is this behavior ideal? Should we track whether we've rejected faults for this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it's probably fine either way. I would go with whatever is easiest and most clear to implement. Can you remove the TODO and make it clear what we have chosen? Also not sure if it's worth it to add docs on the stat that makes this mroe clear, up to you.

Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Snow Pettersen <[email protected]>
@snowp snowp merged commit 2f68705 into envoyproxy:master Aug 29, 2020
clarakosi pushed a commit to clarakosi/envoy that referenced this pull request Sep 3, 2020
This shuffles the check for overflow around so that we only increment
the overflow value when attempting to apply a fault instead of for
each request. This ensures that if applying a low % fault the overflow
stat will be in line with that, instead of counting each request once
the overflow threshold has been hit.

Signed-off-by: Snow Pettersen <[email protected]>
Signed-off-by: Clara Andrew-Wani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FaultFilter] Inaccurate faults overflow metrics
3 participants