-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tuning AWS Rust SDK HTTP request settings for latency-aware Amazon DynamoDB applications #558
Comments
interesting! what operation are you running and what error are you seeing? The SDK doesn't consider every error to be retryable, it's possible that that is the issue? Seeing more logs would also be helpful, could you include the output of |
I am using a GetItem and GetBatchItem. Where should I put I intercept the DynamoDB error in this way:
So the message that is coming out is: |
ah I see—timeouts are not currently considered retryable. Let me follow up and see if that's considered correct or not. Generally we are very hesitant to retry something that could introduce widespread service degradation by increasing the load. |
Other SDKs do it like GO, Node I think all supported retries. Could you tell me what kind of retries are supported? |
The SDK will retry if DynamoDB, e.g. responds with something like |
The best practice from AWS is to set a lower timeout to force the SDK to retry to avoid degradation in performance that you are getting out of the box using the default timeout. It is one of the principal for getting better performance. Many SDK for other runtime have all default timeouts and retries. Node: I am sure this retry works out of the box. I used it all the time :) I am expecting that timeout is also included because it is pretty logical...AWS network has variable latency, and I know that DynamoDB should respond under the 20ms, so instead of waiting 300ms, I force the timeout for the HTTP request that SDK does to get a quick response. I think this will not bring any surprise to anyone, but maybe if you guys are afraid maybe, you can add new: Currently, you do not timeout any call. Instead, you just let it run until maybe the Lambda is timing out (this will hide the real problem), and so this is worse than not having anything. In node, for example, is like this:
|
you're right—I think this is exactly how the Java SDK behaves today as well. This is definitely a bug, and one I hope to get a fix out for shortly |
|
I just tested it out and still failing (aws-sdk-dynamodb = "0.14.0") Notice the Lambda Duration: |
I think you need to remove the "with call timeout and only keep the call attempt" timeout The intention is that the timeout on the entire operation including timeouts is not retryable–this is a hard limit including retries |
(You'll note the error message has changed as well vs. the original) We could improve these errors though to clearly describe the action you'd need to take |
Hi @rcoh, after a good 9h of testing, I can tell this:
But the overall results are much worse now because, for example, it retries 10 times each 30ms. I have my doubts about the with_call_timeout. You said that this is t the limit and it is not retrievable. The documentation said:
If I have a BatchOperation like a Get, it can retry ten times for each Get, but I would like to set if several HTTP requests (with_call_timeout) timeout overall at 100ms, retry. It seems to me that you should implement the retries also for with_call_timeout. Moreover, there is no way to track the retries now. It would be helpful to enable some parameters DEBUG: true to see logs for the retries. For example:
I would be curious to know how many retries when I have high latency there are and the timestamp of the request because I think Retries with exponential backoff is also not implemented in the SDK |
Retry information is currently logged via the An easy way to observe this is by using the The call timeout is intended to be a non-retryable timeout for the overall operation. Basically, it means, "stop everything after this time no matter what." The attempt timeout is intended to be retried. There may be several attempts for a single call. Retry with exponential backoff is implemented, and I've observed it working in a real world project (I wasn't using timeouts, though).
Can you explain what you mean by this a little more? |
First of thank you for the answer. I had already RUST_LOG in my Lambda, so good to know that I can do the debug. I will let it run and see if I will see something in CW
The idea behind setting the timeout is to avoid DynamoDB to run for seconds to return something (bytes) So without the with_call_timeout when (often) DynamoDB does not respond quickly, I saw the duration of Lambda going so high, like > 5s, to load bytes from a query made with PK and SK. The idea behind it is for me to return as fast as possible, and so I need the with_call_timeout, and in fact, I end up having:
After the 95ms mark timeout and I manually retry (while loop of 3), this is faster 285ms (max) overall than letting DynamoDB run until it wants. From the high of my ignorance, I would say that all the timeouts functions should be retriable. Still, if you said the opposite and this is the exact behaviour of the other AWS SDK, I will accept it, and I would suggest putting better comments or examples so that this would be clear to the next one. |
I expect that the difference is your retry is possibly without backoff? |
I tested in 2 ways:
For the backoff I use thread::sleep, and I multiply retry_count * 20ms |
I am sorry to keep going, but I am unsure if retries are happening or if the logs in debug mode are present.
In my Environment variables I have:
Inside CW, I see:
I infer that the debug logs for the retries are not showing up. GetItem operation perhaps worked:
This result is 90ms, and usually, the Lambda run in 8ms, so the total 98ms duration match unless it is one of the BatchGetItem operations, but the INFO logs do not tell. Do I miss something, or I have found something else? |
Ah, I keep forgetting they moved the environment variable functionality into a separate crate feature. Sorry about that. It sounds like you're seeing the retries occur now? |
So for Lambda is you want to write somewhere the correct configuration is:
RUST_LOG is not needed at all. |
When call_timeout is triggered, request will fail irrespective of the retry setting. call_timeout define max time request can take along with retry. It can be solved by either setting call_timeout to retry * call_attempt_timeout or not setting the call_timeout. As per thread call_attempt and rety setting is enough. awslabs/aws-sdk-rust#558
Describe the bug
As per discussion:
#557
The DynamoDB SDK does not retry based on the with_max_attempts, but its exits at the first failure.
The workaround is to put the queries inside a custom code:
Expected Behavior
If I set a timeout at 30ms and 5 retries, for example
I expect the Lambda Duration to be between > 30ms and 150ms based on retries and not exiting immediately.
Current Behavior
The SDK does not retry at all. I think it is for all the clients and not just DynamoDB
Reproduction Steps
set the following configuration:
you will get this message:
request has timed out: API call (single attempt) timeout occurred after 30ms.
But you will notice that the Lambda Duration is 32ms, so the SDK did not retry.
Possible Solution
Implement retries with exponential backoff like the other AWS-SDK
Additional Information/Context
Libs:
Version
Environment details (OS name and version, etc.)
Mac
Logs
ERROR lambda_runtime: SdkError("request has timed out: API call (single attempt) timeout occurred after 30ms")
The text was updated successfully, but these errors were encountered: