-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't return update handles until desired stage reached #2066
Don't return update handles until desired stage reached #2066
Conversation
@WorkflowInterface | ||
public interface SimpleWorkflowWithUpdate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was unused
UpdateWorkflowExecutionResponse result; | ||
do { | ||
result = genericClient.update(updateRequest, pollTimeoutDeadline); | ||
} while (result.getStage().getNumber() < input.getWaitPolicy().getLifecycleStage().getNumber() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we set a default for input.getWaitPolicy()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @drewhoskins-temporal's latest requirements, we want wait-for-stage to be a required field for start. Also, we should call it "wait-for-stage" IMO to match Python and future SDKs (or if we don't like that term, we should call it something else and be consistent across SDKs with what it is called).
@@ -334,8 +334,17 @@ public <R> StartUpdateOutput<R> startUpdate(StartUpdateInput<R> input) { | |||
.setRequest(request) | |||
.build(); | |||
Deadline pollTimeoutDeadline = Deadline.after(POLL_UPDATE_TIMEOUT_S, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the deadline be in the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably it doesn't need to be set at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I unset this in the most recent commit - but, I'm not sure having a super long timeout by default is what we want to do? OTOH I don't have a firm reason why not I suppose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no strong opinion so long as it's always longer than server's by enough to let server return an empty response on its timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should treat start update as a long poll, hence the long timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved
workflowClientInvoker.pollWorkflowUpdate( | ||
new WorkflowClientCallsInterceptor.PollWorkflowUpdateInput<>( | ||
execution, updateName, id, resultClass, resultType, timeout, unit)); | ||
WorkflowClientCallsInterceptor.PollWorkflowUpdateOutput<T> pollCall; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, i have read this a few times and I am not sure logic trying to accomplish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since before the handle is returned from start when user says complete, there might be a result from polling already and if there is we want to use that, otherwise try it - but then we need to wipe that result in case getResult gets called again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm this seems racy, if you have two concurrent calls isn't is possible for pollCall=null
if two threads interleave in the right way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes it is, I didn't think about this being called concurrently, too used to Rust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified this (works better as a cache now too)
@@ -23,7 +23,19 @@ | |||
import io.temporal.api.enums.v1.UpdateWorkflowExecutionLifecycleStage; | |||
|
|||
public enum UpdateWaitPolicy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm in python looks like we changed the name to WorkflowUpdateStage
I think we should do the same here because this enum will also be used when describing an updates stage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 And the docs here about what each enum means only apply to starting an update and maybe should move to there (but maybe not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K, I'm down to change the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs wise, weirdly, even the proto APIs mention nothing about what the stages mean beyond as input to requests. That would be good to fix.
@@ -23,7 +23,19 @@ | |||
import io.temporal.api.enums.v1.UpdateWorkflowExecutionLifecycleStage; | |||
|
|||
public enum UpdateWaitPolicy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 And the docs here about what each enum means only apply to starting an update and maybe should move to there (but maybe not).
UpdateWorkflowExecutionResponse result; | ||
do { | ||
result = genericClient.update(updateRequest, pollTimeoutDeadline); | ||
} while (result.getStage().getNumber() < input.getWaitPolicy().getLifecycleStage().getNumber() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per @drewhoskins-temporal's latest requirements, we want wait-for-stage to be a required field for start. Also, we should call it "wait-for-stage" IMO to match Python and future SDKs (or if we don't like that term, we should call it something else and be consistent across SDKs with what it is called).
UpdateWorkflowExecutionResponse result; | ||
do { | ||
result = genericClient.update(updateRequest, pollTimeoutDeadline); | ||
} while (result.getStage().getNumber() < input.getWaitPolicy().getLifecycleStage().getNumber() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the latest requirements for start were to, if the wait stage is COMPLETED, after ACCEPTED you switched to polling for response before returning from the start call. Can you confirm at least from the user perspective that occurs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Sorry I missed that
// by the user. | ||
UpdateWorkflowExecutionResponse result; | ||
do { | ||
result = genericClient.update(updateRequest, pollTimeoutDeadline); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure down below in pollWorkflowUpdateHelper
that you remove the logic that retries on gRPC deadline exceeded error? That should no longer occur, we should just be bubbling all errors out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, isn't the Python one doing that when it passes retry=True
to the service client? Or, if that doesn't retry timeouts, then where is that happening? Because https://github.com/temporalio/sdk-python/blob/1a2acd59634a3b1d694937b8a8433c0014247370/temporalio/client.py#L4303 says it will, but there's no explicit handling of timeouts here: https://github.com/temporalio/sdk-python/blob/1a2acd59634a3b1d694937b8a8433c0014247370/temporalio/client.py#L4359
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy to change Java to not do this and just default to max timeout for getResult calls, but, not sure that's the right thing to do.
(I committed it so we can see what I mean - works fine, but, seems like maybe not right? At minimum what python is saying the doc vs. what it does is either inconsistent, or the loop is not needed, or not the same as what I've just done here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I need to update that Python doc to remove that last sentence (I fixed logic but forgot about docs). We are no longer using timeout/exceptions to drive the loop.
Just need to remove the idea that deadline exceeded means something special in the start/poll loop. Let all RPC exceptions bubble out as they always would and change the code to only care about the successful result instead of the whenComplete
today that cares about either result or failure (not sure what the combinator is for success-only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's done now. All it's doing is just interpreting the failure code into the right exception type which makes sense to me.
@@ -334,8 +334,17 @@ public <R> StartUpdateOutput<R> startUpdate(StartUpdateInput<R> input) { | |||
.setRequest(request) | |||
.build(); | |||
Deadline pollTimeoutDeadline = Deadline.after(POLL_UPDATE_TIMEOUT_S, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably it doesn't need to be set at all
|
||
return pollCall | ||
.getResult() | ||
.exceptionally( | ||
failure -> { | ||
// If the poll didn't find the completion successfully, reset the previous poll call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because the poll call failed does not mean it shouldn't be cached right? A update rejection would also complete the poll future exceptional If I recall correctly. I would probably drop caching since we don't cache workflow result either and if we get user feedback address all these functions with a consistent strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I could just delete it every time still. I need something like it for the don't-return-handle-until-completed case and I figured why not cache it for the success case since that's definitely not going to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the don't-return-handle-until-completed case you can take the result and put it in the CompletedUpdateHandleImpl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I'd done it this way is to still get all the exception conversion stuff for free (and the encapsulation of the parmeters). So, I'll just wipe out the cache every time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K, I've done a much more targeted version of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still do want to avoid caching most exceptions though, just the ones from the update outcome should be cached
|
||
WorkflowClientCallsInterceptor.PollWorkflowUpdateOutput<T> pollUntilComplete( | ||
long timeout, TimeUnit unit) { | ||
synchronized (this) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using synchronized like this can be really problematic with virtual threads because the virtual thread will be pinned while executing pollWorkflowUpdate
https://mikemybytes.com/2024/02/28/curiosities-of-java-virtual-threads-pinning-with-synchronized/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting didn't realize that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is a very unfortunate issue with the current virtual thread limitation. My current stance is avoid IO in synchronized blocks and remove any I see
.whenComplete( | ||
(r, e) -> { | ||
if ((e instanceof StatusRuntimeException | ||
&& ((StatusRuntimeException) e).getStatus().getCode() | ||
== Status.Code.DEADLINE_EXCEEDED) | ||
|| pollTimeoutDeadline.isExpired() | ||
|| deadline.isExpired() | ||
|| (e == null && !r.hasOutcome())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is supposed to recurse in this situation (keeps retrying until outcome is present)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, OK, I keep getting confused about the situations where server could return no outcome, but, it's like long polling on tasks
c9584e5
to
6e9cb0c
Compare
Don't we also need to update the test server to return the updates current lifecycle state? or does the test server never actually return an empty response? |
I have changed it to do so - here for example https://github.com/temporalio/sdk-java/pull/2066/files#diff-809c076b3ee441df02cf0c4566a20f7abc69c94c15d8b689875350ae3fcdbfd9R1808 |
} | ||
|| deadline.isExpired()) { | ||
resultCF.completeExceptionally( | ||
new TimeoutException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this may be changing shortly with #2069
d9c9d0b
to
0a7499e
Compare
synchronized (this) { | ||
if (waitCompletedPollCall != null) { | ||
pollCall = waitCompletedPollCall; | ||
waitCompletedPollCall = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setFromWaitCompleted
is never changed to true
. I think the intention was to do that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this looks like a bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Java StartUpdate
code path is different from the other SDKs in a few subtle ways as well i'll try to align it with other SDKs as well
What was changed
Why?
See temporalio/features#432
Checklist
Closes [Feature Request] SDK should not return an update handle if the update has not reached the desired state #2002
How was this tested:
Existing / new tests
Any docs updates needed?