-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double the initial delay and timeout for worker container liveness probes #20323
Double the initial delay and timeout for worker container liveness probes #20323
Conversation
…obes Some environments we deploy to need quite a bit of extra time for the worker pods to come up.
Checked commit carbonin@4e43752 with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
¯\_(ツ)_/¯ I wonder how high the timeout can go before the app is completely unusable
Ask @jrafanie I think he was dealing with multi-minute API requests the other day ... |
At some point I think it's better to have the pods killed and recognize that the environment is the issue rather than chasing down multi-minute API/UI requests. |
This is all the logging I managed to capture before we increased the web service worker pod count.
The problem is individual worker pod configurations could be wrong while the rest of the app is fine. If any worker pod is exceeding these new numbers, it should definitely go down...it might come back and may keep restarting, and I guess that constant recycling would be a good way to track when there's an environmental/configuration problem. |
Sorry, that comment was mostly sarcasm. In this instance the app couldn't come up at all I believe, so we at least need that to succeed. I agree that we still need reasonable thresholds for liveness checks once the app is up and that ideally a worker would only fail those checks if it was having a problem. My guess is that this change will satisfy those requirements for environments where performance is not very good generally, but might make these kinds of issues slightly harder to detect in a very high performance environment. For example if you are typically able to run the liveness check in .5 seconds, but it's now taking 7 seconds you probably have an issue worth investigating, but this patch will prevent us from seeing that problem. |
Double the initial delay and timeout for worker container liveness probes (cherry picked from commit 666a4a4)
Jansa backport details:
|
Some environments we deploy to need quite a bit of extra time for
the worker pods to come up.
This required adding a value for
periodSeconds
. The default is to check every 10 seconds, but if we also have the timeout at 10 seconds, we really want to leave some room in between the checks.