Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientModel: ClientRetryPolicy should consider retry-after header #44222

Closed
annelo-msft opened this issue May 23, 2024 · 18 comments · Fixed by #45078
Closed

ClientModel: ClientRetryPolicy should consider retry-after header #44222

annelo-msft opened this issue May 23, 2024 · 18 comments · Fixed by #45078
Assignees
Labels
Azure.Core Client This issue points to a problem in the data-plane of the library. System.ClientModel Base Core library
Milestone

Comments

@annelo-msft
Copy link
Member

Azure.Core's retry policy uses the default DelayStrategy implementation to check and wait to retry based on the response RetryAfter header. Since we don't have DelayStrategy in SCM, we did not implement this logic in ClientRetryPolicy. We should implement this, though, and may be able to address this issue as part of the addition of LRO abstractions to SCM in #42114. We could have an internal type that implements exponential backoff and respects the RetryAfter header by default, and make it overridable if someone specifies a fixed polling interval. Subtypes of the retry policy or LRO implementations from TypeSpec could then override this behavior if needed for their service-specific implementations.

@annelo-msft annelo-msft added Client This issue points to a problem in the data-plane of the library. Azure.Core System.ClientModel Base Core library labels May 23, 2024
@annelo-msft annelo-msft added this to the 2024-06 milestone May 23, 2024
Copy link

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@github-actions github-actions bot added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label May 23, 2024
@annelo-msft annelo-msft removed the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label May 23, 2024
@jsquire jsquire modified the milestones: 2024-06, Backlog Jun 19, 2024
@annelo-msft annelo-msft modified the milestones: Backlog, 2024-08 Jul 18, 2024
@Pilchie
Copy link
Member

Pilchie commented Jul 29, 2024

Sorry to bump this old issue - I've been using Azure OpenAI, and I notice that it uses SCM's ClientPiplelinePolicy, but OpenAI's rate limiting does specify the retry after header. Any chance you know of an implementation of an SCM retry policy that does take this into account?

@annelo-msft
Copy link
Member Author

Hi @Pilchie - a fix for this issue was released in SCM 1.1.0-beta.6. If you reference this version, do you still see the issue?

@Pilchie
Copy link
Member

Pilchie commented Sep 10, 2024

Hi @annelo-msft - thanks for looking at this, and sorry for the delay. Today I tried out pinning SCM 1.1.0-beta7 and seeing what happened. Unfortunately, with the default retry policy, this doesn't actually help. Looking at the code for GetNextDelay, I think the issue is that tryCount is 0, so nextDelayMilliseconds ends up being 1 day, and the retryAfter.TotalMilliseconds is smaller than nextDelayMilliseconds.

@Pilchie Pilchie reopened this Sep 10, 2024
@Pilchie
Copy link
Member

Pilchie commented Sep 10, 2024

Nevermind - I am getting an x-ms-retry-after of 1 day, so that seems ok.

I think there might still be an issue with message.RetryCount being zero leading to a negative time to wait, which means there will be an immediate retry the first time, but that's not as big an issue to me.

@annelo-msft
Copy link
Member Author

Hi @Pilchie -- x-ms-retry-after is an Azure-specific header, so not supported by default by SCM, because SCM provides general HTTP support, but not Azure-specific features. For Azure.AI.OpenAI, @trrwilson was adding capabilities into that client to support the Azure-specific features -- I don't know what the status of that is currently -- if that PR has been merged and what client version has it. @trrwilson could you share that info?

I think there might still be an issue with message.RetryCount being zero leading to a negative time to wait, which means there will be an immediate retry the first time, but that's not as big an issue to me.

This is interesting, and if there's a bug in SCM, I'd love to know about it so we can get it fixed! Would you be willing to share a quick repro case for this?

@Pilchie
Copy link
Member

Pilchie commented Sep 11, 2024

Interesting - I'm not sure how the delay got set to 1 day. Maybe both headers were sent. As for the repro, I was using the code below, with an Azure OpenAI endpoint set to the minimum rate limit:

using Azure.AI.OpenAI;
using Azure.Identity;
using OpenAI.Embeddings;
using System.ClientModel.Primitives;

const string endpointEnvironmentVariableName = "AZURE_OPENAI_ENDPOINT";
const string apiKeyEnvironmentVariableName = "AZURE_OPENAI_APIKEY";

var endpoint = Environment.GetEnvironmentVariable(endpointEnvironmentVariableName) ?? throw new Exception($"Set '{endpointEnvironmentVariableName}'.");
var apiKey = Environment.GetEnvironmentVariable(apiKeyEnvironmentVariableName) ?? throw new Exception($"Set '{apiKeyEnvironmentVariableName}'.");

var client = new AzureOpenAIClient(
    new Uri(endpoint),
    apiKey,
    new AzureOpenAIClientOptions
    {
        RetryPolicy = new ClientRetryPolicy(),
    });

var embeddingClient = client.GetEmbeddingClient("text-embedding-3-large");

const int count = 100;
var inputs = new List<string>(count);
for (int i = 0; i < count; i++)
{
    inputs.Add($"This is a string to embed. It is number {i}.");
}
var result = embeddingClient.GenerateEmbeddings(inputs);

Console.WriteLine(result.GetRawResponse().Status);

Note that I was unable to get a breakpoint to hit inside the retry method, or to step into it, so it's possible something else was happening - I could only inspect the state once it was in the 1-day Wait call.

@Pilchie
Copy link
Member

Pilchie commented Sep 11, 2024

Confirmed that RetryAfter was the header that was sent (in addition to x-ratelimit-reset-tokens and x-ratelimit-remaining-requests)

@annelo-msft
Copy link
Member Author

Hello! Thank you for sharing the details of your use case. I was able to reproduce the case you are seeing using the sample you shared. It sounds like you are seeing, and I also see, that Azure OpenAI is returning a Retry-After header with a value of 86400. This indeed will cause the default ClientRetryPolicy to wait to retry the response for 86400 seconds. I believe that is the correct behavior according to https://www.rfc-editor.org/rfc/rfc7231#section-7.1.3. If you believe the behavior is incorrect, would you be able to share the specific behavior you are expecting to see in this case? And/or, if this behavior is problematic for your application, I'm happy to share approaches to customize the System.ClientModel retry policy to better suit your needs -- let me know if that would be helpful. Thanks!

@annelo-msft
Copy link
Member Author

annelo-msft commented Oct 11, 2024

And actually, since this appears to be the correct behavior that this issue was tracking, I am going to go close this issue, but @Pilchie, feel free to open another issue in the repo if you believe there is a bug in the ClientRetryPolicy, or to request details on how to customize the retry policy in your application. Thanks!

@Pilchie
Copy link
Member

Pilchie commented Oct 12, 2024

I think it's technically correct, but practically infeasible. Having an application unable to make a request for 24 hours as soon as it budges over the quota isn't feasible.

I get that the issue is actually with Azure OpenAI returning an impractical value in RetryAfter, but I think if you ship a version of the Azure OpenAI SDK that uses this value, you will effectively break many applications that are working today.

@annelo-msft
Copy link
Member Author

@Pilchie, I see your point about hanging an application that gets a Retry-After value of 24 hours from a service. It does feel like clients should be robust to services returning very-long retry after values somehow, I agree with that. What would you expect the client behavior to be in this case? The model in these clients is typically a simple request/response paradigm -- i.e. a service method does not return until the service has returned a response that the client can deserialize into a valid return type. It feels like the option we have to notify the caller that they must either suspend the client thread for 24 hours or try a different approach would be to throw an exception. If we threw, we could notify callers that some threshold had been exceeded/would be exceeded if they decided to wait. This would be a breaking change if we introduced the threshold without an explicit opt-in, so I believe we would want to let the caller specify that somehow as a way of opting-in to that behavior. Would that be consistent with your expectations for client behavior? Or is there a different behavior you would expect to see in such a case?

@Pilchie
Copy link
Member

Pilchie commented Oct 14, 2024

I agree that this doesn't put you in a good position - it would be nicer if you could rely on services to provide useful values for RetryAfter: I also don't think that throwing is a great developer experience. Perhaps an alternative would be to allow the user to optionally specify a maximum value to wait before retrying, or a maximum value of RetryAfter: to consider valid?

@annelo-msft
Copy link
Member Author

annelo-msft commented Oct 14, 2024

Yes, I was thinking about this as well. We have MaxDelay in Azure.Core, but the implementation will use the delay from a service retry-after header even if it is larger than the user-specified value of MaxDelay. Which makes me think it would be better to provide a different option for specifying a maximum retry-after value, or some timeout threshold for retry delays.

The question still stands, though -- if this threshold is exceeded, how does the client communicate that to the caller? The contract for the method you're calling in your sample, GenerateEmbeddings, is that it will return a valid ClientResult<OpenAIEmbedding> when the method returns. I think the only thing the client can do if the service hasn't sent the response content to deserialize at the time the threshold is exceeded is throw an exception.

@Pilchie
Copy link
Member

Pilchie commented Oct 14, 2024

Well, my experience with Azure OpenAI is that even if they return a RetryAfter of a 1 day, the call works if you retry after 1 minute, so I would expect that if you exceed a hypothetical new maximum, you would retry again, even though it isn't the 1 day yet, and thus the app would continue to work, because the next retry would be sucessful.

@annelo-msft
Copy link
Member Author

annelo-msft commented Oct 15, 2024

even if they return a RetryAfter of a 1 day, the call works if you retry after 1 minute,

Ah, I see. I did not realize that you are looking to retry the request before the service's recommended wait time has completed.

I don't think SCM types should make this easy for users. Services typically set the Retry-After value for a reason, and retrying before that time should be unsuccessful. Moreover, retrying before that time could be harmful to the service and is not recommended.

It sounds like there are a couple things happening here, which would need to be addressed independently:

  1. If you believe the Azure OpenAI service is returning an incorrect Retry-After header value, please file an issue with the service team so they can address this. Let me know if you need help tracking down the right way to file such an issue.
  2. If you would like to cancel the service call prior to the service returning a success response, I recommend passing a CancellationToken to the service method (GenerateEmbeddings), and cancelling the operation as needed to prevent your app from hanging.
    1. If you want to encode a different retry behavior in your client based on undocumented service behaviors, SCM-based clients will not prevent this -- you can always write a custom retry policy and replace the client's default retry policy by setting an instance of your policy on client options. Again, this is not recommended, just a note regarding possibility.

Thanks!

@Pilchie
Copy link
Member

Pilchie commented Oct 15, 2024

I understand why you don't want to make it easy, but practically what do you actually expect users to do in the case of buggy services like this, where they clearly will respond well before the RetryAfter they send? I expect that as soon as this ships and is taken up by customers, many of them will want the same. It's good that it's possible via cancellation token, and we'll put that in place, but I expect you'll want to document it as well.

Do you have any pointers on how/where to report an issue with the service for 1?

@annelo-msft
Copy link
Member Author

It's good that it's possible via cancellation token

Great!

Do you have any pointers on how/where to report an issue with the service for 1?

I sent you some offline. Thanks for your engagement on this, and thanks for helping us make our Azure SDK clients great!

@github-actions github-actions bot locked and limited conversation to collaborators Jan 14, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Azure.Core Client This issue points to a problem in the data-plane of the library. System.ClientModel Base Core library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@Pilchie @jsquire @annelo-msft and others