Batch Processing - Why does processing continues when there is an Error? #1784

RaphaelManke · 2023-11-06T14:21:22Z

RaphaelManke
Nov 6, 2023

I try to add the Batch Processing capability to our project and try to understand the error handling for Batches.

As a source we consume a DynamoDB stream.

Based on this graphic on the docs
https://docs.powertools.aws.dev/lambda/typescript/latest/utilities/batch/#kinesis-and-dynamodb-streams

Thats our "architecture"

graph LR
stream --batch_of_items--> lambda --put_message--> SQS

Given we have a batch size of 10.
Let's say the 3rd item of the batch is failing. Then this error is handled by the powertools helper and continues the processing for batch 4-10.
So finally the queue will have the record 1-2 and 4-10.

The handler then reports back that item 3 has failed and the checkpoint has moved to item 3.

Then the next batch will processed which will have item 3-10.

Again Item 3 will fail and 4-10 will be processed successfully.

This will have the consequence that record 4-10 will again be put on the queue.

If Item 3 keeps failing the stream processing is on hold and keeps retrying the same items?

Is that correct?
My question is, why did the team decide to continues processing after an error has happened?
An alternative could have been stop processing and report back all the other items as failed.

Do I need to implement idempotency together with PartialProcessing to reduce the message duplication?

Or is a DLQ a solution for this problem so that Item 3 will be moved to the DLQ and therefore the checkpoint can be moved forward?

Answered by dreamorosi

Nov 6, 2023

Hi @RaphaelManke thanks for creating this discussion.

The Batch Processing utility hinges on the feature that allows AWS Lambda to report partial failures to its trigger service when processing a batch of items.

When you set a Lambda function to be triggered by SQS messages (feature docs), Kinesis Stream items (feature docs), and DynamoDB Stream items (feature docs) you can configure it to report partial failures in this batch. This signals the Lambda service that one or more items that were marked as failed should be put back into the source and potentially retried later.

For example, let's take the (simplified) example below:

For simplicity we'll assume that we have 2 sequential batche…

View full answer

dreamorosi · 2023-11-06T19:57:08Z

dreamorosi
Nov 6, 2023
Maintainer

Hi @RaphaelManke thanks for creating this discussion.

The Batch Processing utility hinges on the feature that allows AWS Lambda to report partial failures to its trigger service when processing a batch of items.

When you set a Lambda function to be triggered by SQS messages (feature docs), Kinesis Stream items (feature docs), and DynamoDB Stream items (feature docs) you can configure it to report partial failures in this batch. This signals the Lambda service that one or more items that were marked as failed should be put back into the source and potentially retried later.

For example, let's take the (simplified) example below:

For simplicity we'll assume that we have 2 sequential batches that trigger one single function.

The first batch is composed by items with ids 1 to 5. These items are the batch used to invoke the Lambda handler. The handler uses the Batch Processing utility to call the recordHandler function for each item in your batch.

For this example we'll assume that when processing items 1 & 2 the recordHandler function has returned a value, when processing item 3 it has thrown an error, and when processing items 4 & 5 it also also has returned a value.

When you throw an error within your record handler function, the Batch Processing utility catches this error and marks that item as failed. When instead your function returns, it considers that record as successfully processed.

At the end of the batch, the utility creates an object with this shape:

{
  batchItemFailures: [
    {
      itemIdentifier: "3"
    }
  ],
};

This response tells the Lambda service to take the item with identifier 3 and put it back into the stream. If none of the items failed to process, the response object would be this:

{
  batchItemFailures: []
};

This tells Lambda that all items have processed successfully, and as such they can be removed from the source.

What happens next to the item with identifier 3 depends entirely on how you have configured your function trigger integration. If you have enabled retries, the item will be retried (aka sent to the function in a subsequent batch) for the specified amount of retries. If you have set up a Dead Letter Queue, the item will be removed from the source and put into the queue once it exceeds the number of retries.

For the sake of the example, we'll assume that this is the first time the function has "seen" that item and that retries are enabled. In this case the item will be part of a subsequent batch, together with never-seen-before items. All the items from the original batch that did not fail to process should not go back to the source.

Idempotency per se it's not a requirement, the amount of times an object is seen by your function will depend entirely on the characteristics of the source (i.e. does your source guarantees exactly once delivery) and the retry configuration of your function trigger.

So to sum up:

The key feature of the Batch Processing utility, besides helping you iterate through your batch, is to catch errors and "build" the response object expected by the event type (i.e. SQS, Kinesis, DynamoDB)
When your record handler function throws an error, the Batch Processing utility interprets this as a failure to process that specific item, and will include the identifier of that item in the batchItemFailures response object
Each item is processed independently from the ones that precede or follow it, this allows you to not have to discard the entire batch and report partial failures granularly. The only exception when this is not desirable is when processing SQS FIFO queue, for which the utility offers a dedicated processor. In that case, as soon as an item fails all subsequent items are also marked as failed to preserve ordering.
As long as your function handler returns the response object produced by the Batch Processing utility without itself throwing an error, the Lambda service should interpret this as a signal to put back the objects mentioned in the payload and potentially retry them.

I hope this clarifies a bit how this is supposed to work, if not please let me know and I'll try again 😃

1 reply

RaphaelManke Nov 6, 2023
Author

This is a great explanation 🙌 Thank you very much ❤️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Processing - Why does processing continues when there is an Error? #1784

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Batch Processing - Why does processing continues when there is an Error? #1784

RaphaelManke Nov 6, 2023

Replies: 1 comment · 1 reply

dreamorosi Nov 6, 2023 Maintainer

RaphaelManke Nov 6, 2023 Author

RaphaelManke
Nov 6, 2023

Replies: 1 comment 1 reply

dreamorosi
Nov 6, 2023
Maintainer

RaphaelManke Nov 6, 2023
Author