Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monit] Repeat restarting culprit container if Monit can't reset its counter #10288

Merged
merged 1 commit into from
Apr 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@
## Monit configuration for telemetry container
###############################################################################
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry"
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles
Copy link
Collaborator

@qiluo-msft qiluo-msft Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2

How do you choose the magic number 2? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeat every 2 cycles at here means if Monit failed to reset its counter and memory usage of telemetry is still larger than the threshold (at here 400MB) for 2 minutes, then telemetry container will be restarted.

I selected the number 2 since memory usage of telemetry possibly increased very quickly from MB to around GB within 2 minutes and telemetry should be restarted ASAP.

Other numbers can be selected as well and do you have any suggestion?

Copy link
Collaborator

@qiluo-msft qiluo-msft Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exec

Please explain how do you test this PR? The test could be either manual or ideally automated. You need to make sure the old code fails the test and new code passes the test. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed at last week's meeting, I am working on the PR of pytest testcase to make sure the old code will fail while new one will succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pytest PR is submitted for review: sonic-net/sonic-mgmt#5492.

Copy link
Collaborator

@qiluo-msft qiluo-msft Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeat

You mentioned this is a monit feature. Please make sure monit on all the backport branches have this feature. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will double check whether this new syntax existed on 201811, 201911 and 202012 images or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 201811, 201911 and 202012 images, Monit has the same version 5.20.0. I checked this syntax of Monit with 20191130.83 image on lab device and it did work.

2 changes: 2 additions & 0 deletions files/image_config/monit/memory_checker
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ def check_memory_usage(container_name, threshold_value):
if mem_usage_bytes > threshold_value:
print("[{}]: Memory usage ({} Bytes) is larger than the threshold ({} Bytes)!"
Copy link
Collaborator

@qiluo-msft qiluo-msft Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print

Do you really need print? #Closed

Copy link
Contributor Author

@yozhao101 yozhao101 Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message text from statement print(...) will be appended to the end of Monit alerting message in syslog. Please review the following example:

    Jan 2 02:34:56.029949 DSM07-0101-0613-10BT0 ERR monit[496]: 'container_memory_telemetry' status failed (3) -- [telemetry]: Memory usage (489475276.8 Bytes) is larger than the threshold (419430400 Bytes)!

.format(container_name, mem_usage_bytes, threshold_value))
syslog.syslog(syslog.LOG_INFO, "[{}]: Memory usage ({} Bytes) is larger than the threshold ({} Bytes)!"
.format(container_name, mem_usage_bytes, threshold_value))
sys.exit(3)
else:
syslog.syslog(syslog.LOG_ERR, "[memory_checker] Failed to retrieve memory value from '{}'"
Expand Down