extend Hasura service boot grace period from 0 to 180s #4051
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related to this ticket.
This PR introduces a 180s grace period for the Hasura service in an attempt to deal with a boot loop during scale up.
Context
During load testing, even with relatively small numbers of users we immediately hit issues with Hasura.
When the service tries to scale up with a 2nd Task to deal with the load, the new task fails to boot to a healthy state (which the load balancer checks for before directing traffic to it), and AWS then has to deprovision them (usually within 2 minutes or so of boot).
The issue seems to be with the
Hasura
container itself, since theHasuraProxy
container reports the following in the logs for one of these failed task:Note the
502 Bad Gateway
status code, thereverseproxy.statusError (reverseproxy.go:1299)
error trace (source), and the very short duration of the request.See Notion doc for more detail.