Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double the initial delay and timeout for worker container liveness probes #20323

Merged
merged 1 commit into from
Jul 1, 2020

Conversation

carbonin
Copy link
Member

@carbonin carbonin commented Jul 1, 2020

Some environments we deploy to need quite a bit of extra time for
the worker pods to come up.

This required adding a value for periodSeconds. The default is to check every 10 seconds, but if we also have the timeout at 10 seconds, we really want to leave some room in between the checks.

…obes

Some environments we deploy to need quite a bit of extra time for
the worker pods to come up.
@miq-bot
Copy link
Member

miq-bot commented Jul 1, 2020

Checked commit carbonin@4e43752 with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
1 file checked, 0 offenses detected
Everything looks fine. ⭐

Copy link
Member

@bdunne bdunne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

¯\_(ツ)_/¯ I wonder how high the timeout can go before the app is completely unusable

@bdunne bdunne merged commit 666a4a4 into ManageIQ:master Jul 1, 2020
@carbonin
Copy link
Member Author

carbonin commented Jul 1, 2020

I wonder how high the timeout can go before the app is completely unusable

Ask @jrafanie I think he was dealing with multi-minute API requests the other day ...

@carbonin carbonin deleted the double_worker_container_timeouts branch July 1, 2020 15:36
@bdunne
Copy link
Member

bdunne commented Jul 1, 2020

I wonder how high the timeout can go before the app is completely unusable

Ask @jrafanie I think he was dealing with multi-minute API requests the other day ...

At some point I think it's better to have the pods killed and recognize that the environment is the issue rather than chasing down multi-minute API/UI requests.

@jrafanie
Copy link
Member

jrafanie commented Jul 1, 2020

I wonder how high the timeout can go before the app is completely unusable

Ask @jrafanie I think he was dealing with multi-minute API requests the other day ...

At some point I think it's better to have the pods killed and recognize that the environment is the issue rather than chasing down multi-minute API/UI requests.

This is all the logging I managed to capture before we increased the web service worker pod count.

[----] I, [2020-06-25T16:52:12.895488 #8:2ae4b0e11968]  INFO -- : Completed 200 OK in 102841ms (Views: 0.2ms | ActiveRecord: 77764.4ms)
[----] I, [2020-06-25T16:52:33.699402 #8:2ae4b0e11968]  INFO -- : Completed 200 OK in 20638ms (Views: 0.4ms | ActiveRecord: 4958.9ms)
[----] I, [2020-06-25T16:52:36.872538 #8:2ae4b0e11968]  INFO -- : Completed 200 OK in 3003ms (Views: 0.2ms | ActiveRecord: 2192.1ms)
[----] I, [2020-06-25T16:52:41.689977 #8:2ae4b0e11968]  INFO -- : Completed 200 OK in 4623ms (Views: 0.2ms | ActiveRecord: 3285.5ms)
[----] I, [2020-06-25T16:52:51.529104 #8:2ae4b0e10644]  INFO -- : Completed 200 OK in 108590ms (Views: 0.5ms | ActiveRecord: 81136.6ms)
[----] I, [2020-06-25T16:53:10.784943 #8:2ae4b0e10644]  INFO -- : Completed 200 OK in 18994ms (Views: 0.4ms | ActiveRecord: 5888.3ms)
[----] I, [2020-06-25T16:53:13.898550 #8:2ae4b0e10644]  INFO -- : Completed 200 OK in 2956ms (Views: 0.2ms | ActiveRecord: 2402.0ms)
[----] I, [2020-06-25T16:53:17.898234 #8:2ae4b0e11738]  INFO -- : Completed 200 OK in 78666ms (Views: 0.3ms | ActiveRecord: 57311.7ms)
[----] I, [2020-06-25T16:53:20.299002 #8:2ae4b0e11738]  INFO -- : Completed 200 OK in 2215ms (Views: 0.2ms | ActiveRecord: 1868.4ms)
[----] I, [2020-06-25T16:53:22.241667 #8:2ae4b0e108c4]  INFO -- : Completed 200 OK in 253713ms (Views: 0.4ms | ActiveRecord: 207443.3ms)
[----] I, [2020-06-25T16:53:26.302834 #8:2ae4b0e108c4]  INFO -- : Completed 200 OK in 3989ms (Views: 7.6ms | ActiveRecord: 2427.3ms)
[----] I, [2020-06-25T16:53:29.965678 #8:2ae4b0e108c4]  INFO -- : Completed 200 OK in 3580ms (Views: 0.3ms | ActiveRecord: 3073.3ms)
[----] I, [2020-06-25T16:53:33.849383 #8:2ae4b0e108c4]  INFO -- : Completed 200 OK in 3680ms (Views: 0.2ms | ActiveRecord: 2754.4ms)
[----] I, [2020-06-25T16:53:37.346233 #8:2ae4b0e108c4]  INFO -- : Completed 200 OK in 3261ms (Views: 0.4ms | ActiveRecord: 2845.5ms)
[----] I, [2020-06-25T16:53:55.515446 #8:2ae4b0e11648]  INFO -- : Completed 200 OK in 117064ms (Views: 0.3ms | ActiveRecord: 91285.7ms)

The problem is individual worker pod configurations could be wrong while the rest of the app is fine. If any worker pod is exceeding these new numbers, it should definitely go down...it might come back and may keep restarting, and I guess that constant recycling would be a good way to track when there's an environmental/configuration problem.

@carbonin
Copy link
Member Author

carbonin commented Jul 1, 2020

Sorry, that comment was mostly sarcasm. In this instance the app couldn't come up at all I believe, so we at least need that to succeed.

I agree that we still need reasonable thresholds for liveness checks once the app is up and that ideally a worker would only fail those checks if it was having a problem.

My guess is that this change will satisfy those requirements for environments where performance is not very good generally, but might make these kinds of issues slightly harder to detect in a very high performance environment. For example if you are typically able to run the liveness check in .5 seconds, but it's now taking 7 seconds you probably have an issue worth investigating, but this patch will prevent us from seeing that problem.

simaishi pushed a commit that referenced this pull request Jul 1, 2020
Double the initial delay and timeout for worker container liveness probes

(cherry picked from commit 666a4a4)
@simaishi
Copy link
Contributor

simaishi commented Jul 1, 2020

Jansa backport details:

$ git log -1
commit 3859794ef0fb2a7ccaadccec3c175cfbe1c556e4
Author: Brandon Dunne <[email protected]>
Date:   Wed Jul 1 11:33:33 2020 -0400

    Merge pull request #20323 from carbonin/double_worker_container_timeouts

    Double the initial delay and timeout for worker container liveness probes

    (cherry picked from commit 666a4a4c60f095dedf18d5af2eeaded155b24bde)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants