-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Function Idempotency Helper #28
Comments
Some comments:
|
Sorry one more thing to consider. When there is an error, sometimes we might want to allow for retrying by the same idempotent key. For example, the lambda is calling some downstream service that is temporarily unavailable, calling with the same idempotent could make another downstream request. In this case we would want to remove any lock record that might have been created. However if the end user is passes in an invalid request we should return a 400 type of response and potentially not do a round trip to the persistence layer. This will prevent bad requests from taking up any additional bandwidth. Otherwise, we could fall back to an in memory local fifo cache to serve back just the most recent retries. I feel like if we can some how offload the dynamodb writes to an async task, then we can optimize for the common case of the first request going through. And we should also consider using metric or annotations to track cache hits, so that the developer can see how useful this feature really is. This feature is particularly useful for financial transaction that can take some time to process and you don’t want to double charge the end user. |
Couple features we could pair with this is an idempotent queue. There we want to do some validation and checks, before putting on one request into the dynamodb or some other queue. Idempotent keys would prevent double submits, and the heavier processing can be handled by another process or lambda. Having the lock record at the beginning of the request can work as well for when our lambda execution can run much longer than the request timeout of the caller. As a separate RFC we could have a nonce feature, which is similar to this, but instead we return a single use token. (But that is for another occasion 😄 ) |
Great comments, @michaelbrewer ! Some inline replies.
Yes! I'll amend the flow description, it should be something like "Start -> Extract Key -> Check in DB if running or completed -> If running, return 'BUSY'; if complete, return saved result; else Insert 'RUNNING' -> Continue"
In general terms, the developer needs to provide a uniquely identifiable value for each execution. If that requires a composite key, by all means. I only worry about the complexity of this implementation for v0.1
Agree 100%. This factors in the table design, particularly if we allow the solution to 'create table if not exists'
Agree 100% as well. Optional config flag to KMS encrypt fields. Worth mentioning the (possible) additional cost and time in the docs.
Not sure I follow. The only requirement for the idempotence key is that it uniquely identifies each execution. If requests are made with different parameters, they should have different idemp. keys...
Yes. Few options here. If the function throws an Exception, we can:
Absolutely, best to catch invalid requests as early as possible. |
@igorlg - i agree that some of this feedback is based on more advanced use cases. Use case: Use case: Another interesting feature when pairing this with the API gateway would be to return a header parameter like Stripe has a pretty good implementation of idempotency - https://stripe.com/docs/idempotency |
all great comments - my suggestion is to expand the RFC @igorlg to clarify on:
As to comments by @michaelbrewer on sensitive fields, composite keys, Extensions and metrics, here's my initial take on:
|
@heitorlessa should we allow for an in-memory LRU cache? Also for the in-progress errors, we might want a overall timeout (or permanent failure), but we can't always assume the original lambda was able to run to completion, either to memory or lambda timeout. I think there will be more docs than code for this? And a set of recommendations and warnings about cost. Stripe wrote a couple good articles, and i am sure their are AWS ones out there too. |
@michaelbrewer quite possibly along with a TTL and max LRU cache size - A dict fits the purpose here. Not entirely sure yet on upon-exception behaviour whether we want to burst the LRU cache and call a provided function e.g. If you have some of those articles or ideas please bring it on ;) |
One of the stripe devs has some good articles (somewhat PG related though): |
@mwarkentin - yep, those are the articles i was referring to. I know initially we don’t want to bloat v0.01 of this feature, but i would either flag it as a alpha api with possible breaking changes, or factor in where this could go to in the api design. For example implementing this feature various based on how it is being integrated. In the context of the API gateway or AppSync we might want to have help functions to return errors, http headers and http error codes etc. |
@heitorlessa maybe we should create a shared repo for RFCs and docs shared by the Typescript/Python and Java implementations, so we keep the design in sync and help other maintainers to support the implementation up to date. |
@heitorlessa @nmoutschen should we collate these requirements and start on a prototype implementation? |
Sorry folks I'm unwell but this is being taken care of -- cc @cakepietoast to provide an answer I'll reply to the shared RFC repo as soon as I'm recovered |
Get better soon @heitorlessa! I'm working with @igorlg on this, we expect to have at least the starting point of an implementation added to the repo in the next week or so based on this RFC. |
@igorlg @mwarkentin @michaelbrewer @heitorlessa curious to hear your thoughts on how you think exception handling should work. Given a lambda is invoked and raises an exception, should we:
The current prototype implementation uses a combination of 1 and 2. We provide an option where a user can pass a list of exception classes that are not "retryable". Any subclass of those will then be pickled and stored as per 2. I'd rather avoid use of pickle here for security reasons, but getting rid of it involves some tradeoffs (points 1, 3 or 4). Personally I'm leaning towards 1, but want to get feedback from others before making any changes there. |
I think we should seperate errors that are due to bad input (missing a required parameter) vs downstream errors (like failure to write to RDS).
|
@cakepietoast @heitorlessa I don't mind doing a session on Slack if it is easier to discuss the options. |
I'd suggest not handling exceptions in the idempotency utility, as you might enter the land of Circuit Breaker here - Which is something we want to work on next. Caching the exception and returning the same exception would require Pickle to do this cleanly, and yet would create security concerns. It will also not give the opportunity for the function to succeed, as the error might be related to something else (downstream, as Mike said) -- This can get easily convoluted, specially when Lambda might kill the container when it errors out. This can also make troubleshooting difficult - Try using Circuit Breaker with Idempotency utility in the wrong order, and you won't retries a chance to succeed, including the amount of calls to a persistent store (if Lambda kills the container). If we're trying to be a good citizen for downstream I'd argue this is the land of Circuit Breaker, not within the idempotency scope. That said, we'd be happy to discuss this on a call if that needs be |
I agree with @michaelbrewer's observations, there should be a clear case for downstream problems and ones caused by bad code in the function. Any new invoke that happens when because of local issues (bad func code) should be allowed to run as if it was the first time. Idempotency should be around what happens when there is duplication of the event and how to handle that. One could argue that functions would be considered idempotent if when ran, it didn't change any stored state elsewhere, i.e. when running an insert you don't end up with two rows that look the same. I'm not sure what could be offered from the powertools projects other than checking for a Message ID and deduping it from something like Redis. |
Hi all, I think this is a very interesting RFC but, at first glance, I think there are some important challenges that doesn't seem to be addressed:
Hope this helps |
Thanks for the feedback Pablo! The first two points have been taken into account (and hopefully solved) in the concrete implementation which we're working on in the PR: aws-powertools/powertools-lambda-python#245. Your feedback is very welcome on that! For extensions, indeed we'll have issues if someone has an extension which can cause side effects. I don't think that's something we can guard against in the implementation, so I'll make a note to cover it in the documentation. |
@cakepietoast Thanks for pointing me to the implementation. My bad for not checking there first. |
Just want to capture some additional suggestions I got from @pcolazurdo here:
|
Merged PR aws-powertools/powertools-lambda-python#245. Will release in v1.11.0 next week! |
This is now available in 1.11.0 - Literally just launched ;) 🎉 |
Key information
Summary
Helper to facilitate writing Idempotent Lambda functions.
The developer would specify (via JMESPath) which value from the event will be used as a unique execution identifier, then this helper would search a persistence layer (e.g. DynamoDB) for that ID; if present, get the return value and skip the function execution, otherwise run the function normally and persist the return value + execution ID.
Motivation
Idempotency is a very useful design characteristic of any system. It enables the seamless separation of successful and failed executions, and is particularly useful in Lambdas used by AWS Step Functions. It is also a design principle on the AWS Well Architected Framework - Serverless Lens
Broader description of this idea can be found here
Proposal
Define a Python Decorator
@idempotent
which would receive as arguments a) the JMESPath of the event key to use as execution ID, b) {optional} storage backend configuration, e.g. DynamoDB table name, or ElasticSearch URL + Index).This decorator would wrap the function execution in the following way (pseudo-python):
Usage then would be similar to:
The decorator would first extract the unique execution ID from the Lambda event using the JMESPath provided, then check the persistence layer for a previous successfull execution of the function and - if found - get the previous returned value, de-serialize it (using base64 or something else) and return it instead; otherwise, execute the function handler normally, catch the returned object, serialize + persist it and finally return.
The Persistence layer could be implemented initially with DynamoDB, and either require the DDB table to exist before running the function, or create it during the first execution. It should be in such way as to allow different backends in the future (e.g. Redis for VPC-enabled lambdas).
Drawbacks
This solution could have noticeable performance impacts on the execution of Lambda functions. Every execution would require at at least 1, at most 2 accesses to the persistence layer.
No additional dependencies are required - DynamoDB access is provided by boto3, object serialisation can use Python's native base64encode/decode
Rationale and alternatives
What other designs have been considered? Why not them?
No other designs considered at the moment. Open to suggestions.
What is the impact of not doing this?
Implemention of idempotent Lambda functions will have to be done 'manually' in every function.
Unresolved questions
The text was updated successfully, but these errors were encountered: