Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monit] Repeat restarting culprit container if Monit can't reset its counter #10288

Merged
merged 1 commit into from
Apr 21, 2022

Conversation

yozhao101
Copy link
Contributor

@yozhao101 yozhao101 commented Mar 20, 2022

Signed-off-by: Yong Zhao [email protected]

Why I did it

This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"

If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26

Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55

After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it

In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it

I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • [x ] 202006
  • [x ] 202012
  • [x ] 202106
  • [x ] 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@yozhao101 yozhao101 changed the title [Monit] Restart culprit container if Monit can't reset its counter [Monit] Repeat restarting culprit container if Monit can't reset its counter Mar 20, 2022
@yozhao101 yozhao101 marked this pull request as ready for review March 21, 2022 18:14
@yozhao101 yozhao101 requested a review from lguohan as a code owner March 21, 2022 18:14
@yozhao101 yozhao101 requested a review from qiluo-msft March 21, 2022 18:25
@@ -85,6 +85,8 @@ def check_memory_usage(container_name, threshold_value):
if mem_usage_bytes > threshold_value:
print("[{}]: Memory usage ({} Bytes) is larger than the threshold ({} Bytes)!"
Copy link
Collaborator

@qiluo-msft qiluo-msft Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print

Do you really need print? #Closed

Copy link
Contributor Author

@yozhao101 yozhao101 Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message text from statement print(...) will be appended to the end of Monit alerting message in syslog. Please review the following example:

    Jan 2 02:34:56.029949 DSM07-0101-0613-10BT0 ERR monit[496]: 'container_memory_telemetry' status failed (3) -- [telemetry]: Memory usage (489475276.8 Bytes) is larger than the threshold (419430400 Bytes)!

@@ -2,4 +2,4 @@
## Monit configuration for telemetry container
###############################################################################
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles
Copy link
Collaborator

@qiluo-msft qiluo-msft Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2

How do you choose the magic number 2? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeat every 2 cycles at here means if Monit failed to reset its counter and memory usage of telemetry is still larger than the threshold (at here 400MB) for 2 minutes, then telemetry container will be restarted.

I selected the number 2 since memory usage of telemetry possibly increased very quickly from MB to around GB within 2 minutes and telemetry should be restarted ASAP.

Other numbers can be selected as well and do you have any suggestion?

@@ -2,4 +2,4 @@
## Monit configuration for telemetry container
###############################################################################
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles
Copy link
Collaborator

@qiluo-msft qiluo-msft Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exec

Please explain how do you test this PR? The test could be either manual or ideally automated. You need to make sure the old code fails the test and new code passes the test. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed at last week's meeting, I am working on the PR of pytest testcase to make sure the old code will fail while new one will succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytest PR is submitted for review: sonic-net/sonic-mgmt#5492.

@@ -2,4 +2,4 @@
## Monit configuration for telemetry container
###############################################################################
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles
Copy link
Collaborator

@qiluo-msft qiluo-msft Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeat

You mentioned this is a monit feature. Please make sure monit on all the backport branches have this feature. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will double check whether this new syntax existed on 201811, 201911 and 202012 images or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 201811, 201911 and 202012 images, Monit has the same version 5.20.0. I checked this syntax of Monit with 20191130.83 image on lab device and it did work.

@yozhao101 yozhao101 merged commit e24fe9b into sonic-net:master Apr 21, 2022
@yozhao101 yozhao101 deleted the fix_monitor_telemetry_memory branch April 21, 2022 01:08
qiluo-msft pushed a commit that referenced this pull request Apr 21, 2022
…10288)

Signed-off-by: Yong Zhao <[email protected]>

Why I did it
This PR aims to fix the Monit issue which shows Monit can't reset its counter when monitoring memory usage of telemetry container.

Specifically the Monit configuration file related to monitoring memory usage of telemetry container is as following:

  check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
      if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
If memory usage of telemetry container is larger than 400MB for 10 times within 20 cycles (minutes), then it will be restarted.
Recently we observed, after telemetry container was restarted, its memory usage continuously increased from 400MB to 11GB within 1 hour, but it was not restarted anymore during this 1 hour sliding window.

The reason is Monit can't reset its counter to count again and Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. However, during this 1 hour sliding window, the status of monitored service was not changed from Status failed to Status ok.

Currently for each service monitored by Monit, there will be an entry showing the monitoring status, monitoring mode etc. For example, the following output from command sudo monit status shows the status of monitored service to monitor memory usage of telemetry:

    Program 'container_memory_telemetry'
         status                             Status ok
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Sat, 19 Mar 2022 19:56:26
Every 1 minute, Monit will run the script to check the memory usage of telemetry and update the counter if memory usage is larger than 400MB. If Monit checked the counter and found memory usage of telemetry is larger than 400MB for 10 times
within 20 minutes, then telemetry container was restarted. Following is an example status of monitored service:

    Program 'container_memory_telemetry'
         status                             Status failed
         monitoring status          Monitored
         monitoring mode          active
         on reboot                      start
         last exit value                0
         last output                    -
         data collected               Tue, 01 Feb 2022 22:52:55
After telemetry container was restarted. we found memory usage of telemetry increased rapidly from around 100MB to more than 400MB during 1 minute and status of monitored service did not have a chance to be changed from Status failed to Status ok.

How I did it
In order to provide a workaround for this issue, Monit recently introduced another syntax format repeat every <n> cycles related to exec. This new syntax format will enable Monit repeat executing the background script if the error persists for a given number of cycles.

How to verify it
I verified this change on lab device str-s6000-acs-12. Another pytest PR (sonic-net/sonic-mgmt#5492) is submitted in sonic-mgmt repo for review.
yozhao101 added a commit to sonic-net/sonic-mgmt that referenced this pull request Apr 27, 2022
…w syntax as a workaround (#5492)

What is the motivation for this PR?
This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo.

How did you do it?
Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then

Test whether that container can be restarted by the script ran by Monit.

Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter.

Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal
counter.

How did you verify/test it?
I tested manually against the DuT str-s6000-0n-5 with 20201231.54 and 20191130.83 images. and following is testing results with 20201231.54 image.
wangxin pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Apr 28, 2022
…w syntax as a workaround (#5492)

What is the motivation for this PR?
This PR adds two new test cases to test a Monit issue fixed by the PR (sonic-net/sonic-buildimage#10288) in SONiC image repo.

How did you do it?
Specifically the 'stress' utility is leveraged to increase memory usage of a container continuously, then

Test whether that container can be restarted by the script ran by Monit.

Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
by Monit was unable to restart the container anymore due to Monit failed to reset its internal counter.

Test whether that container can be restarted by the script ran by Moni; If that container was restarted, then test the script ran
by Monit was able to restart the container with the help of new Monit syntax although Monit failed to reset its internal
counter.

How did you verify/test it?
I tested manually against the DuT str-s6000-0n-5 with 20201231.54 and 20191130.83 images. and following is testing results with 20201231.54 image.
liushilongbuaa pushed a commit to liushilongbuaa/sonic-buildimage that referenced this pull request Jun 20, 2022
Related work items: #49, #58, #107, sonic-net#247, sonic-net#249, sonic-net#277, sonic-net#593, sonic-net#597, sonic-net#1035, sonic-net#2130, sonic-net#2150, sonic-net#2165, sonic-net#2169, sonic-net#2178, sonic-net#2179, sonic-net#2187, sonic-net#2188, sonic-net#2191, sonic-net#2195, sonic-net#2197, sonic-net#2198, sonic-net#2200, sonic-net#2202, sonic-net#2206, sonic-net#2209, sonic-net#2211, sonic-net#2216, sonic-net#7909, sonic-net#8927, sonic-net#9681, sonic-net#9733, sonic-net#9746, sonic-net#9850, sonic-net#9967, sonic-net#10104, sonic-net#10152, sonic-net#10168, sonic-net#10228, sonic-net#10266, sonic-net#10288, sonic-net#10294, sonic-net#10313, sonic-net#10394, sonic-net#10403, sonic-net#10404, sonic-net#10421, sonic-net#10431, sonic-net#10437, sonic-net#10445, sonic-net#10457, sonic-net#10458, sonic-net#10465, sonic-net#10467, sonic-net#10469, sonic-net#10470, sonic-net#10474, sonic-net#10477, sonic-net#10478, sonic-net#10482, sonic-net#10485, sonic-net#10488, sonic-net#10489, sonic-net#10492, sonic-net#10494, sonic-net#10498, sonic-net#10501, sonic-net#10509, sonic-net#10512, sonic-net#10514, sonic-net#10516, sonic-net#10517, sonic-net#10523, sonic-net#10525, sonic-net#10531, sonic-net#10532, sonic-net#10538, sonic-net#10555, sonic-net#10557, sonic-net#10559, sonic-net#10561, sonic-net#10565, sonic-net#10572, sonic-net#10574, sonic-net#10576, sonic-net#10578, sonic-net#10581, sonic-net#10585, sonic-net#10587, sonic-net#10599, sonic-net#10607, sonic-net#10611, sonic-net#10616, sonic-net#10618, sonic-net#10619, sonic-net#10623, sonic-net#10624, sonic-net#10633, sonic-net#10646, sonic-net#10655, sonic-net#10660, sonic-net#10664, sonic-net#10680, sonic-net#10683
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants