25 max_retries is really long for a default #1209
Labels
enhancement
Requests to existing resources that expand the functionality or scope.
provider
Pertains to the provider itself, rather than any interaction with AWS.
I've been debugging an issue the past few days where operations on lambda functions were hanging forever. The real root cause eventually turned out to be a crappy corpware DNS resolver returning non-RFC compliant replies that the golang net library didn't like. However, this still turned up terraform issues that significantly delayed me in getting to the root of the problem.
max_retries
as default seems like way too many. If the request itself takes any significant amount of time to fail, you can have a resource stuck spinning for over an hour. Especially because the default retry logic in the SDK is exponential.The retry logic from
aws-sdk-go/aws/client/default_retryer
is effectively2**min(retryCount,13) * randint(30) + 30)
(in milliseconds). (Some of the numbers change to scale up faster if the error is AWS telling the client to throttle)The start of that function is pretty gentle. Assuming an average of 15 on the rand call, it's not until the 8th attempt that you are spending more than a second between calls. But after that, it scales quickly, as exponentials are wont to do. And by the 13th call, you've hit the scaling cap and spend an average of 2 minutes between calls. Summing up over the 25 retries, you can expect to wait an average of 26 mins on a failing request plus whatever time it takes all 25 requests to fail (which can be substantial when the failure involves a request timeout).
Full sequence of average wait times between each request (in seconds):
10 (total ~15s sleeping) to 12 (total ~60s sleeping) as max_retries seems like a better default. Maybe 14 tops where we can expect terraform to wait an average of 4 minutes before bailing.
Terraform Version
0.9.11 and 0.10-1rc1 (w/ aws 1.2)
Example HCL
I used the verbatim HCL from https://www.terraform.io/docs/providers/aws/r/lambda_function.html, but any old HCL will do
Steps to Reproduce
terraform refresh/apply
The text was updated successfully, but these errors were encountered: