Async Workflow Update Outcome Polling #4154

mmcshane · 2023-04-11T14:47:30Z

What changed?

Adds update outcome map to MutableState and functions to lookup in the same
Adds workflow.Observer framework to watch workflow Contexts across cache eviction
Adds future.Proxy for use in a concrete update Observer
Implements the PollWorkflowExecutionUpdate rpc all the way through to Update.Update

Why?
To support polling for the outcome of async updates.

How did you test it?
Unit tests here and e2e test in features repo

Potential risks
Some tricky thread safety and memory tracking pieces in here. Proxy.Rebind and the whole Observers capability in general.

Is hotfix candidate?
Nah

Results of completed updates are stored as pointers into the event history and can be retrieved by update ID.

The trickiness here is that workflow.Context instances can be paged in and out of a cache so you don't want your observer watching a Context that has been evicted. The work here allows observers to be reconnected to a Context when it is loaded back into cache. The Observer module makes not assumption about how observer registration or notification is performed, it only promises to call the Connect function on all appropriate Observer instances, concrete implementations of which will do the type-specific registration and notification.

future.Proxy implements a future.Future that wraps an underlying target Future. That target instance can be changed at runtime without disturbing any existing pollers (i.e. callers of Get) on the Proxy object via the Proxy.Rebind call.

Making use of workflow observers to watch in-flight updates and outcome storage in MutableState/History for completed updates.

alexshtin · 2023-04-18T00:25:25Z

common/future/proxy.go

+		if errors.Is(err, context.Canceled) && parentCtx.Err() == nil {
+			// local context was canceled but not the parent context -
+			// this only happens via a call to Rebind so here we loop back to
+			// call Get() again on the new target.
+			continue
+		}


I have a feeling that this can be done with channels in a more clear way. Here, call should block on select from localTarget.Get or channel to which Rebind sends new future.

I would very much like for this function to be cleaner but I wasn't able to find a way. Can you pseudo-code something up? If I'm understanding correctly, if the number of concurrent callers is >1 and we write the "next" future into a channel, one caller would read it from the channel but how would the others know? Or is the suggestion to have one channel per caller and write the future into all of the channels?

alexshtin · 2023-04-18T01:12:32Z

service/history/workflow/mutable_state_impl.go

@@ -3202,6 +3241,7 @@ func (ms *MutableStateImpl) AddWorkflowExecutionUpdateCompletedEvent(updResp *up
 		return nil, err
 	}
 	event := ms.hBuilder.AddWorkflowExecutionUpdateCompletedEvent(updResp)
+	ms.updateOutcomes[updResp.GetMeta().GetUpdateId()] = &historyspb.EventHistoryPointer{EventId: event.GetEventId()}


This line needs to be in ReplicateWorkflowExecutionUpdateCompletedEvent which also needs to be called from ReapplyEvents func. This will address comment bellow.

Not sure which you mean by "comment below" or if you're saying that this line should be in both places or just one or the other?

alexshtin · 2023-04-18T01:14:04Z

service/history/workflow/observers.go

+	// so a caller should avoid doing things like calling ObserverSet.Remove
+	// from within a call to ObserverSet.FindOrCreate.
+	ObserverSet struct {
+		mut     sync.Mutex


If mutex protects single field, we usually just embed it. Same for future.Proxy. I believe it is common pattern.

That publishes the Lock/Unlock functions on ObserverSet which IMO is a non-starter. How can the type protect itself against some random external entity calling Lock()? With the mutex being a private field one can reason about the locking locally. Also publishing those functions limits any future ability to switch to something like non-locking concurrency control.

FWIW, here's a discussion on the topic uber-go/guide#127

service/history/workflow/mutable_state_impl.go

alexshtin · 2023-04-18T22:03:32Z

service/history/workflow/cache/cache.go

+
+	NewCacheFn func(shard.Context, Observers) Cache
+
+	disableObs int


Why not struct{}?

So that DisableObservation can be const instead of just var

service/history/workflow/update/update.go

service/history/workflow/update/registry.go

alexshtin · 2023-04-19T03:27:26Z

service/history/api/updateworkflow/api.go

-	if err != nil {
-		return nil, err
+	switch req.GetRequest().GetWaitPolicy().GetLifecycleStage() {
+	case enumspb.UPDATE_WORKFLOW_EXECUTION_LIFECYCLE_STAGE_ACCEPTED:


~~Probably~~ not in this PR but we need to support case when update was accepted and completed with the same WT. In this case result should be returned right away, even if LIFECYCLE_STAGE_ACCEPTED.

Agreed - there is a missed opportunity for an optimization here but I think we agree: not a correctness issue.

alexshtin · 2023-04-19T03:29:41Z

service/history/api/pollupdate/api.go

+	return nil
+}
+
+func (uo *updateObserver) AwaitOutcome(ctx context.Context) (*updatepb.Outcome, error) {


This doesn't have to be exported.

I tend to prefer UpperCase names on unexported types as a way to differentiate between functions that are expected to be called within the pkg and functions that shouldn't be called even though they can be. But it's a minor thing, I'll change this.

Also - I could have sworn I wrote a test for this pkg. Baffled. Must have lost it in a rebase or something. I'll try to reconstruct.

alexshtin · 2023-04-19T04:15:55Z

service/history/workflowTaskHandlerCallbacks.go

@@ -383,6 +384,7 @@ func (handler *workflowTaskHandlerCallbacksImpl) handleWorkflowTaskCompleted(
 	if err != nil {
 		return nil, err
 	}
+	defer workflowContext.GetContext().UpdateRegistry().Prune()


I believe it is not correct to call Prune here. It is race condition. What if handler code completes before API caller read update result?

By the time this runs, I believe (but may be wrong!) that (1) updates completed in this WFT will be written to MS as EventPointers, and (2) updates completed in this WFT will deliver an updatepb.Completed message to registered update.Update objects in the update.Registry. If those two things are correct then a Poll call through the API either registers its interest before Prune() in which case it will block on the Future in update.Update, or after Prune() in which case it will "miss" the update.Update object but will then find through MutableState/History.

Have I missed something?

yiminc

Please update ExecutionAPICountLimitOverride to include polling API there, it is used to limit concurrent request count.

mmcshane requested a review from alexshtin April 11, 2023 14:47

mmcshane marked this pull request as ready for review April 13, 2023 16:07

mmcshane requested a review from a team as a code owner April 13, 2023 16:07

mmcshane changed the title ~~wip~~ Async Workflow Update Outcome Polling Apr 13, 2023

Matt McShane added 5 commits April 18, 2023 21:11

Adds an update result map to MutableState

6a03974

Results of completed updates are stored as pointers into the event history and can be retrieved by update ID.

Adds future.Proxy, a rebindable Future

dd1ca27

future.Proxy implements a future.Future that wraps an underlying target Future. That target instance can be changed at runtime without disturbing any existing pollers (i.e. callers of Get) on the Proxy object via the Proxy.Rebind call.

Connect update poll request through to api pkg

7588828

Implementation of workflow update polling

b34ef1c

Making use of workflow observers to watch in-flight updates and outcome storage in MutableState/History for completed updates.

alexshtin reviewed Apr 19, 2023

View reviewed changes

Var naming and comments pursuant to code review

aaa43d7

yiminc reviewed Apr 22, 2023

View reviewed changes

mmcshane closed this Apr 24, 2023

mmcshane deleted the mpm/update-outcome-storage branch May 3, 2023 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async Workflow Update Outcome Polling #4154

Async Workflow Update Outcome Polling #4154

mmcshane commented Apr 11, 2023 •

edited

Loading

alexshtin Apr 18, 2023

mmcshane Apr 19, 2023

alexshtin Apr 18, 2023

mmcshane Apr 19, 2023

alexshtin Apr 18, 2023

mmcshane Apr 19, 2023

alexshtin Apr 18, 2023

mmcshane Apr 19, 2023

alexshtin Apr 19, 2023 •

edited

Loading

mmcshane Apr 19, 2023

alexshtin Apr 19, 2023

mmcshane Apr 19, 2023

mmcshane Apr 19, 2023

alexshtin Apr 19, 2023

mmcshane Apr 19, 2023

yiminc left a comment


		NewCacheFn func(shard.Context, Observers) Cache

		disableObs int

Async Workflow Update Outcome Polling #4154

Async Workflow Update Outcome Polling #4154

Conversation

mmcshane commented Apr 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexshtin Apr 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yiminc left a comment

Choose a reason for hiding this comment

mmcshane commented Apr 11, 2023 •

edited

Loading

alexshtin Apr 19, 2023 •

edited

Loading