Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add begin and wait API for LRO #811

Open
tadelesh opened this issue Apr 26, 2022 · 12 comments
Open

Add begin and wait API for LRO #811

tadelesh opened this issue Apr 26, 2022 · 12 comments
Assignees
Labels
design-discussion An area of design currently under discussion and open to team and community feedback.

Comments

@tadelesh
Copy link
Member

For mgmt. plane SDK, we want to add an API to create poller and poll final result simultaneously. It is an addition for current create poller API (BeginCreateOrUpdate). The possible name for the API is the same with the rest operation name (CreateOrUpdate). The reason is in mgmt. scenario, user always emphasis on the resource result regardless the operation is LRO or not. The problem is duplicity and may involve confusion. Since this should be a non-breaking feature, we may gather more user feedback first.

@tadelesh tadelesh added the design-discussion An area of design currently under discussion and open to team and community feedback. label Apr 26, 2022
@jhendrixMSFT
Copy link
Member

Another option as discussed offline is to provide a wrapper that customers use to get the same behavior.

type OperationWaiter[TResult any] struct {
    poller *runtime.Poller[TResult]
    err    error
}

func (ow OperationWaiter[TResult]) Wait(ctx context.Context, freq time.Duration) (TResult, error) {
    if ow.err != nil {
        return *new(TResult), ow.err
    }
    return ow.poller.PollUntilDone(ctx, freq)
}

func NewOperationWaiter[TResult any](poller *runtime.Poller[TResult], err error) OperationWaiter[TResult] {
    return OperationWaiter[TResult]{poller: poller, err: err}
}

This would allow callers to condense an LRO into a one-line call.

result, err := NewOperationWaiter(client.BeginSomeLRO(context.Background(), nil)).Wait(context.Background(), time.Second)

@xboxeer
Copy link

xboxeer commented May 7, 2022

Based on our user study, true feelings of the LRO and none LRO operation is that they make the code look inconsistent. I know we have customer who write infra level code that would be ok with detailed control but there is customer who write none infra level code, consistency and easy to understand is one of their priorities, for those LRO operation, can we provide a none LRO-look-like call as an alternative?

@tadelesh tadelesh self-assigned this Aug 18, 2022
@tadelesh
Copy link
Member Author

We've noticed that it usually happened that service change an operation from synchronized to asynchronized which cause Go SDK breaking changes. In order to prevent such breaking changes, I'd prefer to generate both synchronized and asynchronized function for LRO. Besides preventing breaking changes, it could also give a simple usage for customer who does not care of the LRO undelaying logic. @jhendrixMSFT @JeffreyRichter What do you think?

@jhendrixMSFT
Copy link
Member

It's an interesting idea. Do you have an idea of how many breaking changes are caused by this?

@tadelesh
Copy link
Member Author

Among current 7 Go mgmt. SDK packages which have v2+ version, 2 of them have the breaking of such change. And I have met more during my swagger review work. Here are some examples:

  1. Mediaservices_CreateOrUpdate
  2. FarmBeatsModels_Update
  3. Workspaces_Update

@tadelesh
Copy link
Member Author

Got another one from SQL. Azure/azure-rest-api-specs#19953

@JeffreyRichter
Copy link
Member

So first, let me say that services should not make this specific breaking change and so there should not be a lot of these.
However, they do happen, and I appreciate the effort to not break Go SDK customers.
I'll also add that the service team is making this break for a reason and not just to make the customer experience worse.

So, my concern is that if the SDK initiates an LRO and this succeeds but then the polling fails, we return an error to the customer and now the customer doesn't know (since we put multiple operations into a single method) what to do: Did initiating the operation fail and so it never got started? Or, did it get started and now there is a problem with knowing its state?

The error recovery for these 2 scenarios is VERY different and a customer would have to write the proper code to deal with them separately/correctly. Having a wrapper function makes this much harder.

@tadelesh
Copy link
Member Author

tadelesh commented Sep 6, 2022

Can we add a new PollingError to distinguish two steps?
Here is some prototype:
Custom error to wrap inner error:

type PollingError[T any] struct {
	inner  error
	Poller *Poller[T]
}

func (e *PollingError[T]) Error() string {
	return e.inner.Error()
}

func (e *PollingError[T]) Unwrap() error {
	return e.inner
}

func NewPollingError[T any](err error, poller *Poller[T]) error {
	return &PollingError[T]{
		inner:  err,
		Poller: poller,
	}
}

Synchronized LRO:

func (client *LinkedServerClient) Delete(ctx context.Context, resourceGroupName string, name string, linkedServerName string, options *LinkedServerClientDeleteOptions) (LinkedServerClientDeleteResponse, error) {
	lroOptions := LinkedServerClientBeginDeleteOptions(*options)
	poller, err := client.BeginDelete(ctx, resourceGroupName, name, linkedServerName, &lroOptions)
	if err != nil {
		return LinkedServerClientDeleteResponse{}, err
	}
	return poller.PollUntilDone(ctx, nil)
}

Usage:

func TestLinkedServerClient_Delete(t *testing.T) {
	cred, err := azidentity.NewDefaultAzureCredential(nil)
	if err != nil {
		log.Fatalf("failed to obtain a credential: %v", err)
	}
	ctx := context.Background()
	client, err := NewLinkedServerClient("subid", cred, nil)
	if err != nil {
		log.Fatalf("failed to create client: %v", err)
	}
	_, err = client.Delete(ctx, "rg1", "cache1", "cache2", nil)
	if err != nil {
		var pollingErr *runtime.PollingError[LinkedServerClientDeleteResponse]
		if errors.As(err, &pollingErr) {
			poller := pollingErr.Poller
			_ = poller // retry with poller
			log.Fatalf("Inner error: %+v", errors.Unwrap(err))
		} else {
			log.Fatalf("Other error: %+v", err)
		}
	}
}

The generic type is a problem for now. But if this way could work, it should be some more elegant implementation.

@JeffreyRichter
Copy link
Member

Your sample codes shows how to detect the error but not how to properly recover from the error.
If the client gets a PollingError, how does it continue polling without initiating the operation again?
Also, if the error is a networking error, then was it due to the initiating call or a polling call?
In theory, you could wrap any underlying error with PollingError but the proper error handling becomes much more complex now.

@xboxeer
Copy link

xboxeer commented Sep 7, 2022

@JeffreyRichter I think in Chenjie's code

if errors.As(err, &pollingErr) {
			poller := pollingErr.Poller
			_ = poller 
			log.Fatalf("Inner error: %+v", errors.Unwrap(err))
		} else {
			log.Fatalf("Other error: %+v", err)
		}

If the error is a polling error, developers can get that poller from the error and try poll until done again
If the error is not a polling, developers can try to re-launch the request, or just simply throw that as an exception
I understand there will be situation that the poller failed again, however that would be some policy setting in customer's code, both our old way of LRO and the new way of LRO cannot handle the failure situation happened in error recovering code

@tadelesh
Copy link
Member Author

tadelesh commented Sep 7, 2022

Yes. Here is the comparison:
Async:

	poller, err := client.BeginDelete(ctx, "rg1", "cache1", "cache2", nil)
	if err != nil {
		// retry for start operation failure
	}
	_, err=poller.PollUntilDone(ctx, nil)
	if err != nil {
		// retry for polling failure
	}

Sync:

	_, err = client.Delete(ctx, "rg1", "cache1", "cache2", nil)
	if err != nil {
		var pollingErr *runtime.PollingError[LinkedServerClientDeleteResponse]
		if errors.As(err, &pollingErr) {
			poller := pollingErr.Poller
			// retry for polling failure
		} else {
			// retry for start operation failure
		}
	}

@tadelesh
Copy link
Member Author

tadelesh commented Sep 26, 2022

@jhendrixMSFT To answer your concern about the execution time increasement if we provide a synchronized method for LRO, I list the LRO execution time distribution as follows. Most operations can be done in 20 seconds. Regarding we also provide asynchronized begin method for customer to choose, it will not block the customer and I think the change is worthy.

Some statistics related to total time for LRO based on CLI telemetry data:
88% of the LRO calls could be completed in 32sec
92% in ~1min (64 sec)
95% in 2min+ (130 sec)
The raw data could be found in polling time consumption analysis - Azure CLI.xlsx

Time(ms) operations percentage
1 92 0.00
2 94 0.004265747
4 118 0.005354874
8 146 0.006625522
16 479 0.021737157
32 1235 0.056044654
64 2325 0.105509167
128 3911 0.177482302
256 6025 0.273416228
512 8143 0.369531675
1024 10524 0.477582138
2048 12760 0.57905246
4096 14827 0.672853512
8192 16653 0.755717916
16384 18200 0.82592122
32768 19482 0.884098748
65536 20356 0.923761118
131072 21003 0.953122164
262144 21433 0.972635687
524288 21669 0.983345435
1048576 21827 0.99051552
2097152 21930 0.99518969

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design-discussion An area of design currently under discussion and open to team and community feedback.
Projects
None yet
Development

No branches or pull requests

4 participants