-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: clarify behavior of hedge_on_per_try_timeout #12983
Conversation
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
@snowp I had a hard time understanding the original phrasing so I took a stab at rewriting it, but I'm not really sure I'm describing how it actually works. |
c27e3bc
to
1582210
Compare
This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
bump @mpuncel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving the docs!
I think you'll also want to update the V3 docs, V2 is on its way out. The V4alpha docs should then be automatically updated once you run the proto_format.sh script.
// response headers would otherwise be retried according the specified | ||
// :ref:`RetryPolicy <envoy_api_msg_route.RetryPolicy>`. | ||
// Indicates that a hedged request should be sent when the per-try timeout is hit. | ||
// This will only occur if the :ref:`RetryPolicy <envoy_api_msg_route.RetryPolicy>` also indicates that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this part is right, retry policy doesn't matter when per try timeout is hit, only whether hedge_on_per_try_timeout is set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reference comment:
RetryStatus RetryStateImpl::shouldHedgeRetryPerTryTimeout(DoRetryCallback callback) {
// A hedged retry on per try timeout is always retried if there are retries
// left. NOTE: this is a bit different than non-hedged per try timeouts which
// are only retried if the applicable retry policy specifies either
// RETRY_ON_5XX or RETRY_ON_GATEWAY_ERROR. This is because these types of
// retries are associated with a stream reset which is analogous to a gateway
// error. When hedging on per try timeout is enabled, however, there is no
// stream reset.
return shouldRetry(true, callback);
}
(the comment wording there is also confusing!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I was actually working off my experiencing using this feature: I only started seeing (hedged) retries once I set x-envoy-retry-on
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this code, it seems that retries_remaining_
would only get initialized to a non-zero value if retry_on_
is set:
envoy/source/common/router/retry_state_impl.cc
Lines 121 to 127 in 2709b6b
if (retry_on_ != 0 && request_headers.EnvoyMaxRetries()) { | |
uint64_t temp; | |
if (absl::SimpleAtoi(request_headers.getEnvoyMaxRetriesValue(), &temp)) { | |
// The max retries header takes precedence if set. | |
retries_remaining_ = temp; | |
} | |
} |
Thus shouldRetry
would end up returning RetryStatus::NoRetryLimitExceeded
for the hedged requests.
I can imagine this wasn't intended, since it surprised me too, but on the other hand it's congruent with how non-hedged retrying behaves.
// This will only occur if the :ref:`RetryPolicy <envoy_api_msg_route.RetryPolicy>` also indicates that | ||
// timed out requests should be retried (e.g. retry_on set to 'gateway-error' etc). (Other retry policies | ||
// would also apply, but would only have effect if the response came back before the request was hedged against; | ||
// otherwise such responses would simply be discarded as a retry is already in flight.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be clearer stated as
"Any response received after the timeout and subsequent hedge attempt will never be retried, no matter the RetryPolicy"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming timeout 150ms, per-try timeout of 50ms, 3 retries and retry-on: 5xx policy, and hedging enabled:
0ms: Request 1 sent.
50ms: Request 1 times out, (hedged) request 2 sent.
75ms: Request 2 (hedged) returns 500.
150ms: Request 1 times out.
Would there be a 3rd request?
(for simplicity assuming no exponential backoff in those timings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there would be a 3rd request. Request 2 is considered a new attempt, so it will be retried if it times out or returns a 500. If request 1 comes back with a 500 after request 2 has already been sent, that will be dropped and not retried
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, I read (Any response received after the timeout) AND (subsequent hedge attempt) -> will never be retried
🤦
// * After per-try timeout, an error response would be discard, as a retry in the form of a hedged request is already in progress. | ||
// | ||
// Note: For this to have effect, the :ref:`RetryPolicy <envoy_api_msg_route.RetryPolicy>` must be one that retries on timeout | ||
// (e.g. `gateway-error`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this is true, you don't need to have gateway-error
to have it be retried. A per try timeout is always retried when hedging is enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it and found it to be true. (In fact, wasted quite some time trying to understand why hedging didn't work for me before I tried adding a simple x-envoy-retry-on to my calls.).
I actually wrote this above in reply to a similar question you raised:
Looking at this code, it seems that retries_remaining_
would only get initialized to a non-zero value if retry_on_
is set:
envoy/source/common/router/retry_state_impl.cc
Lines 121 to 127 in 2709b6b
if (retry_on_ != 0 && request_headers.EnvoyMaxRetries()) { | |
uint64_t temp; | |
if (absl::SimpleAtoi(request_headers.getEnvoyMaxRetriesValue(), &temp)) { | |
// The max retries header takes precedence if set. | |
retries_remaining_ = temp; | |
} | |
} |
Thus shouldRetry
would end up returning RetryStatus::NoRetryLimitExceeded
for the hedged requests.
I can imagine this wasn't intended, since it surprised me too, but on the other hand it's congruent with how non-hedged retrying behaves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting! I do think that isn't intentional. In that case I'm not really sure how to word the comment, maybe you could say "you must have a RetryPolicy that retries at least one error code and specify the max number of retries". You don't have to have gateway-error
specifically though it looks like from that code snippet
Signed-off-by: Ilya Konstantinov <[email protected]>
Signed-off-by: Ilya Konstantinov <[email protected]>
Signed-off-by: Ilya Konstantinov <[email protected]>
Signed-off-by: Ilya Konstantinov <[email protected]>
Signed-off-by: Ilya Konstantinov <[email protected]>
Signed-off-by: Ilya Konstantinov <[email protected]>
@ikonst I think this change looks good now. Could you a) merge master b) fix the formatting issue and c) apply the same change to the V3 docs (then run Also friendly reminder that force pushing breaks the reviewing flow for many, so avoid it if you can. Thanks! |
I'm not a fan of rebases either since they break the already-reviewed / new-commits separation. (Perhaps I did this one because I forgot some sign-offs mid-way?) |
Signed-off-by: Ilya Konstantinov <[email protected]>
Signed-off-by: Ilya Konstantinov <[email protected]>
Signed-off-by: Ilya Konstantinov <[email protected]>
/retest |
Retrying Azure Pipelines: |
/retest |
Retrying Azure Pipelines: |
Signed-off-by: Ilya Konstantinov <[email protected]>
@snowp ^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
@envoyproxy/api-shepherds for API review and V2 sign off
/lgtm v2-freeze |
* master: (70 commits) upstream: avoid reset after end_stream in TCP HTTP upstream (envoyproxy#14106) bazelci: add fuzz coverage (envoyproxy#14179) dependencies: allowlist CVE-2020-8277 to prevent false positives. (envoyproxy#14228) cleanup: replace ad-hoc [0, 1] value types with UnitFloat (envoyproxy#14081) Update docs for skywalking tracer (envoyproxy#14210) Fix some errors in the switch statement when decode dubbo response (envoyproxy#14207) Windows: enable tests and envoy-static.exe pdb file (envoyproxy#13688) http: add Kill Request HTTP filter (envoyproxy#14170) dependencies: fix release_dates error behavior. (envoyproxy#14216) thrift filter: support skip decoding data after metadata in the thrift message (envoyproxy#13592) update cares (envoyproxy#14213) docs: clarify behavior of hedge_on_per_try_timeout (envoyproxy#12983) repokitteh: add support for randomized auto-assign. (envoyproxy#14185) [grpc] validate grpc config for illegal characters (envoyproxy#14129) server: Return nullopt when process_context is nullptr (envoyproxy#14181) [Windows] Fix thrift proxy tests (envoyproxy#13220) kafka: add missing unit tests (envoyproxy#14195) doc: mention gperftools explicitly in PPROF.md (envoyproxy#14199) Removed `--use-fake-symbol-table` option. (envoyproxy#14178) filter contract: clarification around local replies (envoyproxy#14193) ... Signed-off-by: Michael Puncel <[email protected]>
No description provided.