Double the initial delay and timeout for worker container liveness probes #20323

carbonin · 2020-07-01T15:10:18Z

Some environments we deploy to need quite a bit of extra time for
the worker pods to come up.

This required adding a value for periodSeconds. The default is to check every 10 seconds, but if we also have the timeout at 10 seconds, we really want to leave some room in between the checks.

…obes Some environments we deploy to need quite a bit of extra time for the worker pods to come up.

miq-bot · 2020-07-01T15:20:30Z

Checked commit carbonin@4e43752 with ruby 2.5.7, rubocop 0.69.0, haml-lint 0.28.0, and yamllint
1 file checked, 0 offenses detected
Everything looks fine. ⭐

bdunne

¯\_(ツ)_/¯ I wonder how high the timeout can go before the app is completely unusable

carbonin · 2020-07-01T15:36:07Z

I wonder how high the timeout can go before the app is completely unusable

Ask @jrafanie I think he was dealing with multi-minute API requests the other day ...

bdunne · 2020-07-01T15:45:29Z

I wonder how high the timeout can go before the app is completely unusable

Ask @jrafanie I think he was dealing with multi-minute API requests the other day ...

At some point I think it's better to have the pods killed and recognize that the environment is the issue rather than chasing down multi-minute API/UI requests.

jrafanie · 2020-07-01T15:49:59Z

I wonder how high the timeout can go before the app is completely unusable

Ask @jrafanie I think he was dealing with multi-minute API requests the other day ...

At some point I think it's better to have the pods killed and recognize that the environment is the issue rather than chasing down multi-minute API/UI requests.

This is all the logging I managed to capture before we increased the web service worker pod count.

[----] I, [2020-06-25T16:52:12.895488 #8:2ae4b0e11968]  INFO -- : Completed 200 OK in 102841ms (Views: 0.2ms | ActiveRecord: 77764.4ms)
[----] I, [2020-06-25T16:52:33.699402 #8:2ae4b0e11968]  INFO -- : Completed 200 OK in 20638ms (Views: 0.4ms | ActiveRecord: 4958.9ms)
[----] I, [2020-06-25T16:52:36.872538 #8:2ae4b0e11968]  INFO -- : Completed 200 OK in 3003ms (Views: 0.2ms | ActiveRecord: 2192.1ms)
[----] I, [2020-06-25T16:52:41.689977 #8:2ae4b0e11968]  INFO -- : Completed 200 OK in 4623ms (Views: 0.2ms | ActiveRecord: 3285.5ms)
[----] I, [2020-06-25T16:52:51.529104 #8:2ae4b0e10644]  INFO -- : Completed 200 OK in 108590ms (Views: 0.5ms | ActiveRecord: 81136.6ms)
[----] I, [2020-06-25T16:53:10.784943 #8:2ae4b0e10644]  INFO -- : Completed 200 OK in 18994ms (Views: 0.4ms | ActiveRecord: 5888.3ms)
[----] I, [2020-06-25T16:53:13.898550 #8:2ae4b0e10644]  INFO -- : Completed 200 OK in 2956ms (Views: 0.2ms | ActiveRecord: 2402.0ms)
[----] I, [2020-06-25T16:53:17.898234 #8:2ae4b0e11738]  INFO -- : Completed 200 OK in 78666ms (Views: 0.3ms | ActiveRecord: 57311.7ms)
[----] I, [2020-06-25T16:53:20.299002 #8:2ae4b0e11738]  INFO -- : Completed 200 OK in 2215ms (Views: 0.2ms | ActiveRecord: 1868.4ms)
[----] I, [2020-06-25T16:53:22.241667 #8:2ae4b0e108c4]  INFO -- : Completed 200 OK in 253713ms (Views: 0.4ms | ActiveRecord: 207443.3ms)
[----] I, [2020-06-25T16:53:26.302834 #8:2ae4b0e108c4]  INFO -- : Completed 200 OK in 3989ms (Views: 7.6ms | ActiveRecord: 2427.3ms)
[----] I, [2020-06-25T16:53:29.965678 #8:2ae4b0e108c4]  INFO -- : Completed 200 OK in 3580ms (Views: 0.3ms | ActiveRecord: 3073.3ms)
[----] I, [2020-06-25T16:53:33.849383 #8:2ae4b0e108c4]  INFO -- : Completed 200 OK in 3680ms (Views: 0.2ms | ActiveRecord: 2754.4ms)
[----] I, [2020-06-25T16:53:37.346233 #8:2ae4b0e108c4]  INFO -- : Completed 200 OK in 3261ms (Views: 0.4ms | ActiveRecord: 2845.5ms)
[----] I, [2020-06-25T16:53:55.515446 #8:2ae4b0e11648]  INFO -- : Completed 200 OK in 117064ms (Views: 0.3ms | ActiveRecord: 91285.7ms)

The problem is individual worker pod configurations could be wrong while the rest of the app is fine. If any worker pod is exceeding these new numbers, it should definitely go down...it might come back and may keep restarting, and I guess that constant recycling would be a good way to track when there's an environmental/configuration problem.

carbonin · 2020-07-01T15:57:56Z

Sorry, that comment was mostly sarcasm. In this instance the app couldn't come up at all I believe, so we at least need that to succeed.

I agree that we still need reasonable thresholds for liveness checks once the app is up and that ideally a worker would only fail those checks if it was having a problem.

My guess is that this change will satisfy those requirements for environments where performance is not very good generally, but might make these kinds of issues slightly harder to detect in a very high performance environment. For example if you are typically able to run the liveness check in .5 seconds, but it's now taking 7 seconds you probably have an issue worth investigating, but this patch will prevent us from seeing that problem.

Double the initial delay and timeout for worker container liveness probes (cherry picked from commit 666a4a4)

simaishi · 2020-07-01T20:18:26Z

Jansa backport details:

$ git log -1
commit 3859794ef0fb2a7ccaadccec3c175cfbe1c556e4
Author: Brandon Dunne <[email protected]>
Date:   Wed Jul 1 11:33:33 2020 -0400

    Merge pull request #20323 from carbonin/double_worker_container_timeouts

    Double the initial delay and timeout for worker container liveness probes

    (cherry picked from commit 666a4a4c60f095dedf18d5af2eeaded155b24bde)

Double the initial delay and timeout for worker container liveness pr…

4e43752

…obes Some environments we deploy to need quite a bit of extra time for the worker pods to come up.

carbonin added bug core jansa/yes? labels Jul 1, 2020

carbonin requested review from Fryguy and gtanzillo as code owners July 1, 2020 15:10

carbonin assigned bdunne Jul 1, 2020

bdunne approved these changes Jul 1, 2020

View reviewed changes

bdunne merged commit 666a4a4 into ManageIQ:master Jul 1, 2020

carbonin deleted the double_worker_container_timeouts branch July 1, 2020 15:36

gtanzillo added jansa/yes and removed jansa/yes? labels Jul 1, 2020

simaishi pushed a commit that referenced this pull request Jul 1, 2020

Merge pull request #20323 from carbonin/double_worker_container_timeouts

3859794

Double the initial delay and timeout for worker container liveness probes (cherry picked from commit 666a4a4)

simaishi added jansa/backported and removed jansa/yes labels Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double the initial delay and timeout for worker container liveness probes #20323

Double the initial delay and timeout for worker container liveness probes #20323

carbonin commented Jul 1, 2020 •

edited

Loading

miq-bot commented Jul 1, 2020

bdunne left a comment •

edited

Loading

carbonin commented Jul 1, 2020

bdunne commented Jul 1, 2020

jrafanie commented Jul 1, 2020

carbonin commented Jul 1, 2020

simaishi commented Jul 1, 2020

Double the initial delay and timeout for worker container liveness probes #20323

Double the initial delay and timeout for worker container liveness probes #20323

Conversation

carbonin commented Jul 1, 2020 • edited Loading

miq-bot commented Jul 1, 2020

bdunne left a comment • edited Loading

Choose a reason for hiding this comment

carbonin commented Jul 1, 2020

bdunne commented Jul 1, 2020

jrafanie commented Jul 1, 2020

carbonin commented Jul 1, 2020

simaishi commented Jul 1, 2020

carbonin commented Jul 1, 2020 •

edited

Loading

bdunne left a comment •

edited

Loading