Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Function Idempotency Helper #28

Closed
igorlg opened this issue Nov 25, 2020 · 25 comments
Closed

RFC: Function Idempotency Helper #28

igorlg opened this issue Nov 25, 2020 · 25 comments
Assignees

Comments

@igorlg
Copy link

igorlg commented Nov 25, 2020

Key information

Summary

Helper to facilitate writing Idempotent Lambda functions.
The developer would specify (via JMESPath) which value from the event will be used as a unique execution identifier, then this helper would search a persistence layer (e.g. DynamoDB) for that ID; if present, get the return value and skip the function execution, otherwise run the function normally and persist the return value + execution ID.

Motivation

Idempotency is a very useful design characteristic of any system. It enables the seamless separation of successful and failed executions, and is particularly useful in Lambdas used by AWS Step Functions. It is also a design principle on the AWS Well Architected Framework - Serverless Lens

Broader description of this idea can be found here

Proposal

Define a Python Decorator @idempotent which would receive as arguments a) the JMESPath of the event key to use as execution ID, b) {optional} storage backend configuration, e.g. DynamoDB table name, or ElasticSearch URL + Index).

This decorator would wrap the function execution in the following way (pseudo-python):

from aws_lambda_powertools.itempotent import PersistenceLayer

def idempotent(func, event_key, persistence_config):
  def wrapper(*args, **kwargs):
    persistence = PersistenceLayer(persistence_config)
    key = jmespath.find(event_key, **kwargs['event'])
    persistence.find(key)

    if persistence.executed_successfully():
      return persistence.result()

    try:
      result = func(*args, **kwargs)
      persistence.save_success(key, result)
      return result
    except Exception => e:
      persistence.save_error(key, e)

  return wrapper

Usage then would be similar to:

from aws_lambda_powertools.itempotent import itempotent

@idempotent(event_key='Event.UniqueId', persistence_config='dynamodb://lambda-idemp-table')
def handler(event, context):
  # Normal function code here
  return {'result': 'OK', 'message': 'working'}

The decorator would first extract the unique execution ID from the Lambda event using the JMESPath provided, then check the persistence layer for a previous successfull execution of the function and - if found - get the previous returned value, de-serialize it (using base64 or something else) and return it instead; otherwise, execute the function handler normally, catch the returned object, serialize + persist it and finally return.

The Persistence layer could be implemented initially with DynamoDB, and either require the DDB table to exist before running the function, or create it during the first execution. It should be in such way as to allow different backends in the future (e.g. Redis for VPC-enabled lambdas).

Drawbacks

This solution could have noticeable performance impacts on the execution of Lambda functions. Every execution would require at at least 1, at most 2 accesses to the persistence layer.

No additional dependencies are required - DynamoDB access is provided by boto3, object serialisation can use Python's native base64encode/decode

Rationale and alternatives

  • What other designs have been considered? Why not them?
    No other designs considered at the moment. Open to suggestions.

  • What is the impact of not doing this?
    Implemention of idempotent Lambda functions will have to be done 'manually' in every function.

Unresolved questions

  • How to make the persistence layer access as fast as possible?
  • Which other persistence layers to consider (DynamoDB, ElasticSearch, Redis, MySQL)?
@michaelbrewer
Copy link
Contributor

Some comments:

  • We should handle duplicate calls, and short circuit with a in progress error. (This would mean one additional write to create the initial lock and a second right when the task has been completed). For Dynamodb could use conditional conditional expression to “insert or get” the saved request
  • We should allow for composite keys. Like an request id and the authenticated user id (like a cognito sub). This would prevent users from accessing other event results.
  • Could we leverage Lambda extensions to offload the saving of the result to dynamodb?
  • There should be an optional TTL on these records, for example we cache the result for up to 24 hours.
  • In same cases what we store in dynamodb might be sensitive. So an optional flag for field level encryption would help
  • We might also want to validate if the same idempotent key has been used for different request parameters. This would prevent misuse of this feature by the end user. Either the end user is not generating a unique key or are reusing the same key for a different request. To support this we would also optional store the request parameters.

@michaelbrewer
Copy link
Contributor

Sorry one more thing to consider.

When there is an error, sometimes we might want to allow for retrying by the same idempotent key. For example, the lambda is calling some downstream service that is temporarily unavailable, calling with the same idempotent could make another downstream request. In this case we would want to remove any lock record that might have been created.

However if the end user is passes in an invalid request we should return a 400 type of response and potentially not do a round trip to the persistence layer. This will prevent bad requests from taking up any additional bandwidth.

Otherwise, we could fall back to an in memory local fifo cache to serve back just the most recent retries.

I feel like if we can some how offload the dynamodb writes to an async task, then we can optimize for the common case of the first request going through.

And we should also consider using metric or annotations to track cache hits, so that the developer can see how useful this feature really is.

This feature is particularly useful for financial transaction that can take some time to process and you don’t want to double charge the end user.

@michaelbrewer
Copy link
Contributor

Couple features we could pair with this is an idempotent queue. There we want to do some validation and checks, before putting on one request into the dynamodb or some other queue. Idempotent keys would prevent double submits, and the heavier processing can be handled by another process or lambda. Having the lock record at the beginning of the request can work as well for when our lambda execution can run much longer than the request timeout of the caller.

As a separate RFC we could have a nonce feature, which is similar to this, but instead we return a single use token. (But that is for another occasion 😄 )

@igorlg
Copy link
Author

igorlg commented Nov 25, 2020

Great comments, @michaelbrewer ! Some inline replies.

  • We should handle duplicate calls, and short circuit with a in progress error. (This would mean one additional write to create the initial lock and a second right when the task has been completed). For Dynamodb could use conditional conditional expression to “insert or get” the saved request

Yes! I'll amend the flow description, it should be something like "Start -> Extract Key -> Check in DB if running or completed -> If running, return 'BUSY'; if complete, return saved result; else Insert 'RUNNING' -> Continue"

  • We should allow for composite keys. Like an request id and the authenticated user id (like a cognito sub). This would prevent users from accessing other event results.

In general terms, the developer needs to provide a uniquely identifiable value for each execution. If that requires a composite key, by all means. I only worry about the complexity of this implementation for v0.1

  • Could we leverage Lambda extensions to offload the saving of the result to dynamodb?
  • There should be an optional TTL on these records, for example we cache the result for up to 24 hours.

Agree 100%. This factors in the table design, particularly if we allow the solution to 'create table if not exists'

  • In same cases what we store in dynamodb might be sensitive. So an optional flag for field level encryption would help

Agree 100% as well. Optional config flag to KMS encrypt fields. Worth mentioning the (possible) additional cost and time in the docs.

  • We might also want to validate if the same idempotent key has been used for different request parameters. This would prevent misuse of this feature by the end user. Either the end user is not generating a unique key or are reusing the same key for a different request. To support this we would also optional store the request parameters.

Not sure I follow. The only requirement for the idempotence key is that it uniquely identifies each execution. If requests are made with different parameters, they should have different idemp. keys...

  • When there is an error, sometimes we might want to allow for retrying by the same idempotent key. For example, the lambda is calling some downstream service that is temporarily unavailable, calling with the same idempotent could make another downstream request. In this case we would want to remove any lock record that might have been created.

Yes. Few options here. If the function throws an Exception, we can:

  • a) Record that information in the DB (e.g. key=xyz, status=FAILED) and then, on the next execution w/ key=xyz, we know to continue running, or
  • b) Delete the record w/ key=xyz, so next time it'll be like a first execution, or
  • c) Expose a few exception types that dictate the behaviour for next executions (retry or not).
  • However if the end user is passes in an invalid request we should return a 400 type of response and potentially not do a round trip to the persistence layer. This will prevent bad requests from taking up any additional bandwidth.

Absolutely, best to catch invalid requests as early as possible.

@michaelbrewer
Copy link
Contributor

@igorlg - i agree that some of this feedback is based on more advanced use cases.

Use case:
Lambda is part of an api gateway function used by external developers (or internal developers but on a different team). So we would want to protect against usage errors. Like reusing idempotent keys for different requests. For V0.01 we could just document that the request id has to be server generated using UUID. And retries should be for the exactly same request.

Use case:
Lambda can be called via a Appsync resolver or Api gateway directly. So to help prevent a bad user reading a cached response for another user, a compound key taking the user inputted fields along with authenticated fields. (Again we could defer this to the documentation).

Another interesting feature when pairing this with the API gateway would be to return a header parameter like Idempotent-Replayed: true to signify whether the response was a cache hit or a replay.

Stripe has a pretty good implementation of idempotency - https://stripe.com/docs/idempotency

@heitorlessa
Copy link
Contributor

heitorlessa commented Nov 25, 2020

all great comments - my suggestion is to expand the RFC @igorlg to clarify on:

  • Responding to duplicate requests (e.g. callback function like batch utility, return last response cached, depending on event source respond X)
  • Responding to exceptions (e.g. clear cache, retry/do-not-retry)
  • Standalone function for more control over failure mode
  • Persistence layer should follow the Parameters implementation of Bring Your Own - Implement one, DynamoDB, and leave others for post-launch
  • Separate what's must have vs nice to have
  • Sequence diagram or text (step->step) to quickly visualize steps
  • Don't worry about naming of the parameters or pseudo code - We can handle UX later after we get down to the must haves

As to comments by @michaelbrewer on sensitive fields, composite keys, Extensions and metrics, here's my initial take on:

  • Sensitive fields and Extensions - Customers can customize their persistent layer on inserting and querying the data. In the case of Extensions, they can query filesystem, use IPC, etc.
  • Composite keys - JMESPath allows for multiselect. If the return from the query selection is a Sequence or a Dict, we can serialize it and encode/hash it
  • Metrics - If we support a callback fn (e.g. idempotency_handler()), customers can use Metrics utility or their own mechanism to track that. We'd call this function when the request is idempotent and when using the decorator, so they'd have full control over the logic, how the response should be returned given their event source (e.g. API GW vs SFN)

@michaelbrewer
Copy link
Contributor

michaelbrewer commented Nov 26, 2020

@heitorlessa should we allow for an in-memory LRU cache?

Also for the in-progress errors, we might want a overall timeout (or permanent failure), but we can't always assume the original lambda was able to run to completion, either to memory or lambda timeout.

I think there will be more docs than code for this? And a set of recommendations and warnings about cost.

Stripe wrote a couple good articles, and i am sure their are AWS ones out there too.

@heitorlessa heitorlessa pinned this issue Nov 27, 2020
@heitorlessa
Copy link
Contributor

@michaelbrewer quite possibly along with a TTL and max LRU cache size - A dict fits the purpose here.

Not entirely sure yet on upon-exception behaviour whether we want to burst the LRU cache and call a provided function e.g. idempotency_handler. That is why my ask to Igor to expand on responding to duplicate events as well as exceptions.

If you have some of those articles or ideas please bring it on ;)

@mwarkentin
Copy link

One of the stripe devs has some good articles (somewhat PG related though):

https://brandur.org/idempotency-keys

https://stripe.com/blog/idempotency

@michaelbrewer
Copy link
Contributor

@mwarkentin - yep, those are the articles i was referring to.

I know initially we don’t want to bloat v0.01 of this feature, but i would either flag it as a alpha api with possible breaking changes, or factor in where this could go to in the api design.

For example implementing this feature various based on how it is being integrated. In the context of the API gateway or AppSync we might want to have help functions to return errors, http headers and http error codes etc.

@michaelbrewer
Copy link
Contributor

@heitorlessa maybe we should create a shared repo for RFCs and docs shared by the Typescript/Python and Java implementations, so we keep the design in sync and help other maintainers to support the implementation up to date.

@michaelbrewer
Copy link
Contributor

@heitorlessa @nmoutschen should we collate these requirements and start on a prototype implementation?

@heitorlessa
Copy link
Contributor

Sorry folks I'm unwell but this is being taken care of -- cc @cakepietoast to provide an answer

I'll reply to the shared RFC repo as soon as I'm recovered

@to-mc
Copy link

to-mc commented Dec 9, 2020

Get better soon @heitorlessa! I'm working with @igorlg on this, we expect to have at least the starting point of an implementation added to the repo in the next week or so based on this RFC.

@to-mc
Copy link

to-mc commented Jan 11, 2021

@igorlg @mwarkentin @michaelbrewer @heitorlessa curious to hear your thoughts on how you think exception handling should work. Given a lambda is invoked and raises an exception, should we:

  1. Ignore it completely in the idempotency logic, meaning the lambda code will not be "protected" from retries which cause an unhandled exception. This simplifies the code, and will apply idempotency logic only for succesfully completed lambda executions.
  2. Pickle the exception and store it in the idempotency store. For the next execution with a matching idempotency key, we can unpickle and return the exact same exception.
  3. Find an alternative to pickle to serialize and deserialize the exception. I'm having a hard time thinking of a good solution for this that doesn't have the same security consequences as pickle. It would likely require either a third party lib or some relatively complex code to implement.
  4. Store a serialized representation of the exception. For the next execution with a matching idempotency key, return a generic exception with some details from the originally thrown exception. This will complicate error handling for the caller and wouldn't be truly idempotent as the response would not be the same as the original invocation.
  5. Something else entirely!

The current prototype implementation uses a combination of 1 and 2. We provide an option where a user can pass a list of exception classes that are not "retryable". Any subclass of those will then be pickled and stored as per 2. I'd rather avoid use of pickle here for security reasons, but getting rid of it involves some tradeoffs (points 1, 3 or 4). Personally I'm leaning towards 1, but want to get feedback from others before making any changes there.

@michaelbrewer
Copy link
Contributor

I think we should seperate errors that are due to bad input (missing a required parameter) vs downstream errors (like failure to write to RDS).

  • For validation errors we can handle before any of the idempotent checks. And failures will also result in a bad input error of sorts. Retrying will result in the same result, if the client calling the lambda is behaving, would correct the request and retry with a new idempotent key

NOTE: I am assuming where that whenever we first handle a idempotent request, we create an empty locking record, than can later be used to short circuit any retries with an already in progress type of error

  • For downstream errors we should make this configurable.

REMOVAL - Remove the locking record and return the error. (Then retrying won't return the already in progress error, and if the downstream recovers things continue as normal)
FLAG_IN_ERROR - We update the locking record from in progress to error with a json stringified user error (not, this should be something we can safely persist) and then throw or return the error. Here retries would always fail and even trying with different idempotent key might also fail (if it is a downstream error)

@michaelbrewer
Copy link
Contributor

@cakepietoast @heitorlessa I don't mind doing a session on Slack if it is easier to discuss the options.

@heitorlessa
Copy link
Contributor

I'd suggest not handling exceptions in the idempotency utility, as you might enter the land of Circuit Breaker here - Which is something we want to work on next.

Caching the exception and returning the same exception would require Pickle to do this cleanly, and yet would create security concerns. It will also not give the opportunity for the function to succeed, as the error might be related to something else (downstream, as Mike said) -- This can get easily convoluted, specially when Lambda might kill the container when it errors out.

This can also make troubleshooting difficult - Try using Circuit Breaker with Idempotency utility in the wrong order, and you won't retries a chance to succeed, including the amount of calls to a persistent store (if Lambda kills the container).

If we're trying to be a good citizen for downstream I'd argue this is the land of Circuit Breaker, not within the idempotency scope.

That said, we'd be happy to discuss this on a call if that needs be

@sthulb
Copy link
Contributor

sthulb commented Jan 14, 2021

I agree with @michaelbrewer's observations, there should be a clear case for downstream problems and ones caused by bad code in the function.

Any new invoke that happens when because of local issues (bad func code) should be allowed to run as if it was the first time.

Idempotency should be around what happens when there is duplication of the event and how to handle that. One could argue that functions would be considered idempotent if when ran, it didn't change any stored state elsewhere, i.e. when running an insert you don't end up with two rows that look the same.

I'm not sure what could be offered from the powertools projects other than checking for a Message ID and deduping it from something like Redis.

@pcolazurdo
Copy link

Hi all, I think this is a very interesting RFC but, at first glance, I think there are some important challenges that doesn't seem to be addressed:

  • In the example above, the time that passes between persistence.find(key) and persistence.save_*(key, result) (including the time spent in the actual processing) allows concurrent lambdas to be invoked for the same event in a different worker that won't know anything about this execution until that time. Without a locking strategy on find I don't think this code is doing what it tries to.
  • Also, failure to call or execute save_* (i.e. throttling) will allow for non-idempotent executions. Adding a locking mechanism on find may help, but it will add complexity, as it has to handle more calls to the persistence layer and also add a mechanism for recovery if the execution is aborted somehow, something like a short TTL. This TTL could be managed via Lambda destinations but this would restrict the type of invocations that can be handled making it harder to understand for the user.
  • Using extensions adds additional complexity as this will create an out-of-process call which may fail or be delayed so the whole purpose of this util will be voided

Hope this helps

@to-mc
Copy link

to-mc commented Jan 25, 2021

Thanks for the feedback Pablo! The first two points have been taken into account (and hopefully solved) in the concrete implementation which we're working on in the PR: aws-powertools/powertools-lambda-python#245. Your feedback is very welcome on that! For extensions, indeed we'll have issues if someone has an extension which can cause side effects. I don't think that's something we can guard against in the implementation, so I'll make a note to cover it in the documentation.

@pcolazurdo
Copy link

@cakepietoast Thanks for pointing me to the implementation. My bad for not checking there first.

@to-mc
Copy link

to-mc commented Jan 28, 2021

Just want to capture some additional suggestions I got from @pcolazurdo here:

  1. Emit metrics for ItemAlreadyExists, IdempotencyInconsistentStateError, ItemNotFoundError etc. This should be opt-in behaviour.
  2. Different TTLs for successful and in progress (locked) records. If a record is stuck in progress for longer than the lambda timeout period, it indicates that something went wrong processing, and users may want to retry that. However, for successfully completed operations it is more likely that users will want to avoid a duplicate operation for a longer period of time.

@to-mc to-mc self-assigned this Feb 9, 2021
@to-mc
Copy link

to-mc commented Feb 19, 2021

Merged PR aws-powertools/powertools-lambda-python#245. Will release in v1.11.0 next week!

@heitorlessa
Copy link
Contributor

This is now available in 1.11.0 - Literally just launched ;) 🎉

@heitorlessa heitorlessa unpinned this issue Mar 6, 2021
@heitorlessa heitorlessa transferred this issue from aws-powertools/powertools-lambda-python Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants