-
Notifications
You must be signed in to change notification settings - Fork 107
fix: truncate RPC timeouts to time remaining in totalTimeout #1191
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1191 +/- ##
=============================================
+ Coverage 58.91% 79.06% +20.14%
- Complexity 115 1197 +1082
=============================================
Files 20 205 +185
Lines 589 5268 +4679
Branches 60 435 +375
=============================================
+ Hits 347 4165 +3818
- Misses 213 930 +717
- Partials 29 173 +144
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good, but I'm concerned by the possibility of breaking LRO portion.
gax/src/main/java/com/google/api/gax/retrying/ExponentialRetryAlgorithm.java
Outdated
Show resolved
Hide resolved
Duration.ofNanos(clock.nanoTime()) | ||
.minus(Duration.ofNanos(prevSettings.getFirstAttemptStartTimeNanos())); | ||
Duration timeLeft = globalSettings.getTotalTimeout().minus(timeElapsed).minus(randomDelay); | ||
newRpcTimeout = Math.min(newRpcTimeout, timeLeft.toMillis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if newRpcTimeout <= 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If newRpcTimeout <= 0
, the subsequent check by shouldRetry
would prevent another attempt from being made, because they do a similar check to see if the current time + the proposed random delay would exceed the totalTimeout
:
long totalTimeSpentNanos =
clock.nanoTime()
- nextAttemptSettings.getFirstAttemptStartTimeNanos()
+ nextAttemptSettings.getRandomizedRetryDelay().toNanos();
// If totalTimeout limit is defined, check that it hasn't been crossed
if (totalTimeout > 0 && totalTimeSpentNanos > totalTimeout) {
return false;
}
We could add some logic to handle if (timeLeft.isNegative() || timeLeft.isZero())
, but I'm not sure what we'd set it to...maybe allow newRpcTimeout
to remain unchanged by timeLeft
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After our offline conversation, I've decided to just document the likely behavior. LMK if you feel strongly otherwise.
.setInitialRpcTimeout(Duration.ofMillis(100)) | ||
.setMaxRpcTimeout(Duration.ofSeconds(1)) | ||
.setRpcTimeoutMultiplier(2) | ||
.setTotalTimeout(Duration.ofSeconds(5L)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of totalTimeout in case of LRO? Note that exponential retry algorithm is shared between retries and LRO polling logic. Whatever we do for retries, we must ensure that it does not break LRO. Ideally we want that new logic be completelly disabled fro LRO. I.e. the need to modify LRO tests indicates that there is a breaking behavioral change for LRO, which we should avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought and I was confused by what this test was doing.
The test setup actually sets by default a totalTimeout
to be 5 ms
, but then in this test increases the initialRpcTimeout
and maxRpcTimeout
to values much greater than the existing totalTimeout
of 5ms
. Per comments in a generated client, the initialRpcTimeout
and maxRpcTimeout
should be ignored and set to 0 for LROs? Not really sure what the test is doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW we had no GAPIC config for LRO polling initialRpcTimeout
or maxRpcTimeout
, so they would never be generated and only ever set by users...and they'd have the same behavior as non-LRO RPCs where a poll could run over the totalTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this test was testing for the existing incorrect behavior, where the RPC timeout didn't care if it exceeded the totalTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The totalTimeout
in the context of LRO is the "total polling timeout" (gapic config). So, the duration a "synchronous" LRO call should poll before giving up.
This test was set up weirdly, as I described. PTAL.
Would like others to approve this as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
gax/src/test/java/com/google/api/gax/retrying/ExponentialRetryAlgorithmTest.java
Outdated
Show resolved
Hide resolved
For the record: this change will cause this Java Showcase test in |
The RetrySettings documentation indicated that an RPC timeout would not be allowed to enable an RPC to exceed the
totalTimeout
while in flight. Meaning, if a calculated RPC timeout is greater than the time remaining in thetotalTimeout
(from the start of the first attempt), the RPC timeout should be equal to that time remaining. This prevents an RPC from being sent with a timeout that would allow it to execute beyond the time allotted by thetotalTimeout
.This was not being done in gax-java, even though the documentation stated it and several other language implementations of GAX are implemented to match this same documentation.
This PR corrects that.