Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose current retry count in the context of the function #2595

Closed
jeffhollan opened this issue Sep 30, 2020 · 11 comments
Closed

Expose current retry count in the context of the function #2595

jeffhollan opened this issue Sep 30, 2020 · 11 comments
Assignees

Comments

@jeffhollan
Copy link

jeffhollan commented Sep 30, 2020

Per the new retry feature rolling out, one scenario that is desired is the ability to deadletter / capture something that is on its final retry. For example, if I have an event hub triggered function and define a retry policy of 5, on the 5th retry if it fails I want to catch that failure and store it in a deadletter queue or something so I can go inspect it later. However, today the context for the current retry count isn't surfaced directly, and using something like a local variable to increment may be tricky when multiple executions could be executing / retrying on the same host.

Proposal is that there is a new context passed into the ExecutionContext that includes information on the retry.

try
{
}
catch()
{
   if(context.retries.count >= context.retries.max)
   { 
     // deadletter
   }
   throw exception;
}

// @pragnagopa @fabiocav @mathewc

@ghost ghost assigned fabiocav Sep 30, 2020
@jeffhollan
Copy link
Author

And this would need to be piped into Java, Python, C#, PowerShell, and JavaScript context objects - not sure if additional work needed on the language libraries to support this

@jeffhollan jeffhollan added this to the Triaged milestone Sep 30, 2020
@pragnagopa pragnagopa assigned pragnagopa and unassigned fabiocav Sep 30, 2020
@pragnagopa
Copy link
Member

pragnagopa commented Sep 30, 2020

I agree retrycount is useful to include in execution context. Once this change is pulled into functions host --> Rpc layer needs to be updated to include this info. Language workers need to be updated to consume change in Rpc.

@jeffhollan
Copy link
Author

FWIW I think just the retry count would be good, retry count + defined maximum slightly better, but best case we pass in the entire RetryContext with info on the exception that was hit last

@mathewc
Copy link
Member

mathewc commented Oct 1, 2020

Yes, we might surface RetryContext on ExecutionContext. We'd just have to add MaxRetries to RetryContext - it's not there now. Not sure yet if we should surface that raw type, or have a new type with the members needed. E.g. RetryContext currently has an IFunctionInstance member on it that we might not want to expose here. Need to think about it more.

Another important thing - we need to be sure that users understand that there may actually be 2 levels of retries going on. In addition to the retries we're talking about (e.g. retry a single queue message dispatch N times), the queue trigger binding itself has a MaxDequeueCount which is the second level of retries. This is only an issue for triggers with retries already built in (Queue/ServiceBus). So applying this to your example above, you'd need to check both RetryCount as well as DequeueCount :)

@jeffhollan
Copy link
Author

Moving comment from @casper-79 here: Azure/azure-functions-java-library#132 (comment)

I have tested the retry functionality today and seen the exponential retry strategy in action. I am also seeing some very strange behaviour, however. As I understand the documentation the retry strategy is implemented on the function instance itself, rather than storing the delivery state on the queue. I am seeing what I believe is side effects of this approach. My experiments center around submitting poisonous messages that will always fail onto a queue consumed by a Java azure function. The function uses a retry strategy defined in host.json as seen below:

"retry":{
"strategy":"exponentialBackoff",
"maxRetryCount":6,
"minimumInterval":"00:00:10",
"maximumInterval":"00:05:00"
}
(1) Processing of poisonous messages does not always show up in the Application Insights and the "monitor" section of Azure functions. When I use the Azure portal to peek look at test messages I can tell DeliveryCount has gone up by 1, but more often than not there is no trace of the failed execution that increased the counter.

(2) Azure function instances are short lived, thereby affecting the useful range of the parameters in the retry configuration parameters. Can you provide guidance on what will work in practice? I am guessing you will run into problems if you set maximumInterval to 24 hours and retryCount to 30 in my host.json?

(3) What is the recommended approach for dead lettering? The only solution I can think of is to set maxDeliveryCount=1 on the queue, but this will only work if all retry attempts of your strategy can be be performed within the typical lifetime of an instance. Otherwise, I guess the message will be retried for ever.

@jeffhollan
Copy link
Author

(we may need to break out a few of these other issues but replying in bulk here):

One thing worth noting is that the host retry policy layers on top of the trigger source - it doesn't know anything about the trigger source directly. So for Event Hubs this is easy because Event Hubs itself has no retry policy so this is the only one. However, queues is a bit more interesting. Let's imagine you have a service bus queue, you have a retry policy on the queue (from service bus) of 5 attempts. You then define a function app retry policy on top of it of 5 attempts. What will happen is:

  1. Pulls the queue message from service bus (attempt 1 from service bus)
  2. Fails 5 times on the host (service bus knows nothing about these 5 attempts, only the function host does
  3. After the 5th time, it abondons the message and now service bus knows it's failed eventually, and then requeues it
  4. Pulls the queue message from service bus (attempt 2 from service bus)
  5. etc. etc. until eventually this message is attempted 25 times, at which point service bus will deadletter.

So when you mention in one you are looking at the DeliveryCount, the delivery count between host retries will be the same on the queue message. it's the same queue message being retried.

You are right on #2 that if you have long delays on something like a timer trigger, it's possible you'll get scaled down. There is some logic in the scale controller to look at queue length and execution logs, we are investigating it to create clear guidance but in general I'd say on the consumption plan you likely don't want a delay of longer than 5ish minutes (10 may be safe too, longer than 10 you are likely at risk of scale down). I don't believe retry attempts will matter, but I'll try to test this out as well.

For the third point, that's where this issue sits. Given the above, I'd say you likely don't want to do maxDeliveryCount of 1 just in case something happens. We are planning to pass in the context so you could have logic like the original comment in this issue that could help as well.

@casper-79
Copy link

Thanks @jeffhollan

Your description matches my own initial expectations, but even though i looked for signs of step 4/5 I have yet to observe this behaviour. It seems failed messages are just stuck on the queue after the first failure in step 3.

When is step 4 supposed to take place? Up until now retries at the queue level have happened immediately, so this is what I have been looking for.

Best regards,

Casper

@alrod
Copy link
Member

alrod commented Feb 3, 2021

merged #2658

@ghost
Copy link

ghost commented Apr 27, 2021

Is this something that has been implemented in C# yet? If so, is there a link for documentation or some kind of resource for it? I see NodeJS, Java, and Powershell so far. We are trying to implement some final retry logic but have been unsuccessful through what we've found here and the links within. We currently have a private static retryCount = 0 and increment it on every retry and set to 0 on final retry or success. Which, I believe makes it somewhat stateful in a REST environment since it holds the value between Retries and calls.

@fabiocav
Copy link
Member

@LockpickingDev this should be available for C# in-proc. Is that what you're currently using?

@ghost
Copy link

ghost commented Apr 29, 2021

All we're doing is taking in a JSON request from an external source and storing it. Theres really not much to it. After reading, it sounds like that's not in-proc though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants