-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling "poison" messages in event hubs #18850
Comments
Thank you for your feedback. Tagging and routing to the team best able to assist. |
Hi @remoba. Thank you for reaching out and we regret that you're experiencing difficulties. With respect to dealing with exceptions within your handlers, our guidance is documented here. The most relevant part is:
The behavior that you're describing is expected. When your exception is thrown it goes unhandled and, in this case, crashes the task processing that partition. When the processor restarts the partition processing task, it reads the latest checkpoint and begins from the next event after it. If you want to treat a message as poison, you'll need to be sure to create a checkpoint based on it, so that it is skipped. I would strongly advise not letting that exception bubble, as the behavior is undefined and will vary by your host process, runtime version, and other environmental factors. Handle it in whatever way makes sense for your application, and let the processor continue on. There is a base class, This blog article walks through creating a custom processor that is similar to |
Hey @jsquire, thanks for the quick response. If I am reading this right then, I am basically never supposed to throw an exception. If my goal here is to not let any message go unprocessed, would you say I should just keep retrying the single message processing forever? |
Correct; None of the code used in your event handlers should throw unless you're confident in your host environment's behavior and are comfortable trusting it.
That kind of depends on your definition of "any message" and the sensitivity to ordering in your application. My assumption was that the poison message was something that you could not process and that you wanted to do some form of dead-lettering and then move onto the next message. In my head, the barebones form looks something like: private static async Task ProcessEventHandler(ProcessEventArgs eventArgs)
{
try
{
// If there was no event, there's nothing to do; only
// applicable if a max wait time was specified.
if (!eventArgs.HasEvent)
{
return;
}
// Run your business logic, including any retries. If you
// identify the message as poison, throw a specific exception to
// denote it.
await ApplyBusinessLogicAsync(eventArgs.Data);
}
catch (PoisonMessageException)
{
// Push the event to some form of storage to dead letter it.
await SendToDeadLetterAsync(eventArgs.Data);
// Checkpoint this event so that we don't see it again.
await eventArgs.UpdateCheckpointAsync();
}
catch (Exception ex)
{
// Decide here if the unexpected exception was safe to continue from or
// take some action to fail fast. In this case, I'll tear down the process and
// assume an orchestrator is monitoring to restart.
if (!TryHandleProcessingException(args, ex))
{
Environment.Exit(-1);
}
}
} If you're looking to stop the world on a poison message, it get a bit more involved since you can't directly call public class AppHost
{
private volatile bool allowProcessingEvents = true;
public async Task Run()
{
// This token is used by the host to control when to stop processing.
using var hostCancellationSource = new CancellationTokenSource();
async Task processEventHandler(ProcessEventArgs eventArgs)
{
try
{
if ((args.CancellationToken.IsCancellationRequested)
|| (!allowProcessingEvents)
|| (!eventArgs.HasEvent))
{
return;
}
// I am deliberately not honoring the cancellation token from
// the args when applying logic here. Because timing is
// non-deterministic, you may see events be dispatched
// while cancellation is being performed when the processor
// is stopped.
await ApplyBusinessLogicAsync(eventArgs.Data);
}
catch (Exception ex)
{
// You may or may not want to checkpoint, depending on whether
// you want to skip this event in the future.
allowProcessingEvents = false;
// Cancellation the host source will trigger a request to stop
// all processing. Additional events may be dispatched during this
// time.
hostCancellationSource.Cancel();
// I'm assuming that you'd want to do some form of
// exception handling to log.
HandleProcessingException(args, ex);
}
}
try
{
// The error handler is not relevant for this sample; for
// illustration, using a fake stub.
processor.ProcessEventAsync += processEventHandler;
processor.ProcessErrorAsync += someFakeErrorHandler;
try
{
await processor.StartProcessingAsync(hostCancellationSource.Token);
await Task.Delay(Timeout.Infinite, hostCancellationSource.Token);
}
catch (TaskCanceledException)
{
// This is expected
}
finally
{
await processor.StopProcessingAsync();
}
}
catch
{
// If this block is invoked, then something external to the
// processor was the source of the exception.
}
finally
{
processor.ProcessEventAsync -= processEventHandler;
processor.ProcessErrorAsync -= someFakeErrorHandler;
}
}
}
Not in the same manner. The legacy SDK passed exceptions in developer-provided code to the error handler in the When we were in the design phase for the new library, we found is that it was not common for developers to implement the optional error handler and perform a state consistency check as part of that handler. That led to exceptions going unobserved, and unexpected behavior during processing that was potentially causing data corruption while the processor continued on. That led to the decision that because the processor lacks enough knowledge to understand whether it is safe to continue in the face of an error in developer-provided code, it was better to fail fast and allow the application host environment to control exception behavior. For most host environments, the task responsible for processing the partition is faulted, partition processing crashes and is restarted. To be transparent, this was a difficult decision with a fair amount of contention. My personal preference would have been to provide more determinism and guarantee the "just crash the partition task", but there were other viewpoints advocating to allow for container-based hosts to crash the process and allow the orchestrator to maintain control. |
@jsquire Thank you very much for the detailed response. Going with the "retry-forever" approach, is there any risk of having my checkpoint blob lease expire during my retries? Because I would essentially not be updating the checkpoint for a very long time right? So if I retry such a message for 5 minutes (because my dead letter queue mechanism is experiencing an outage), when I eventually do update the checkpoint, can it fail? |
The TLDR version is that you're safe to retry forever. Until your handler returns, no other events from that partition will be dispatched for processing and no other processor will attempt to steal that partition. An important callout is that your handler will still be called concurrently for events in other partitions, but only one event is ever being processed from any partition at a time. One of the things that we did in the new version of the SDK was to split the concept of checkpointing and ownership. Partition ownership is managed by the processor as part of it's load balancing cycles and will continue to refresh ownership timestamps regularly regardless of whether you're writing checkpoints. The load balancing algorithm has also changed quite a bit to prevent "partition bouncing" when processors stole from one another. In the new approach, there are only two scenarios where you'll see you'll see a partition stolen:
|
Thanks @jsquire, that clarifies things quite a bit.
|
|
@jsquire Regarding #1, so is this flow a possibility? Just to verify I understood correctly.
|
Hi @remoba. You're correct; that flow is entirely possible. The big call-out for me is what you've numbered 3 and 4 since the ordering could be transposed and could "rewind" the checkpoint to an earlier point in time. That would potentially lead to further duplication if additional scaling or a processor crash happens at that moment in time. This is an unfortunate side-effect of trading off a more complicated and robust load balancing mechanism for one that is simpler, lighter weight, and more performant. This is the part where I make a gentle reminder that it is highly recommended that you ensure that your processing is resilient to event duplication in whatever way is appropriate for your application scenarios. Because Event Hubs itself offers an at-least-once delivery guarantee, this is important even in cases where you're not using the processor or have extended it with a more robust load balancing mechanism. |
@jsquire Gotcha, to clarify - I wasn't raising this as a problem, just wanted to verify that the checkpoint will indeed successfully update for both instances in this race. Thank you very much for all of your time and answers, I will be closing this thread now. |
My pleasure; please feel free to reach out in the future if you run into anything further. |
Query/Question
I am trying to figure out how the event hub SDK responds to unhandled exceptions. My goal is to STOP processing any more events from a given partition if I fail to process a single event in it. I know event hub is probably really not suited for this, but I cannot tolerate any data loss in this data pipeline.
I've tried running the sample receiver in a local environment, and I've created a storage account and an event hub with 2 partitions. Then I sent 3 messages to that event hub's first partition (Partition 0).
My full code is below (with the resource identifiers and secrets removed):
I am trying to simulate a case where message #2 (SequenceNumber == 1) fails during processing, and throws an unhandled exception. I also only update the checkpoint after message #3 (SequenceNumber == 2) has finished processing.
The behavior I see when I run this sample locally is:
If I remove the checkpoint update entirely, I instead see the following behavior:
My problem here is that message #3 is being received after I've already threw an exception for message #2, so I can't simply update the checkpoint without risking data loss, but if I don't then I get message #3 again and again even though I've already processed it.
If my understanding of the code is correct, this is happening because the EventProcessorClient class is receiving batches, but not exposing them. Instead they are triggering my function 1 by 1, and only acknowledge any unhandled exceptions after the whole batch was processed, which in my case is too late.
Is there a different class I can use that will let me process an entire batch at a time instead of a single message? That way I could decide per batch that I do not want to update the checkpoint, and I will keep receiving the same batch over and over again.
I saw there's a generic EventProcessor under the Primitives namespace, but seeing as everything is internal I can't really implement it myself.
Environment:
dotnet --info
output for .NET Core projects): Windows 10 .NET Core 3.1The text was updated successfully, but these errors were encountered: