-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monit] Repeat restarting culprit container if Monit can't reset its counter #10288
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,4 +2,4 @@ | |
## Monit configuration for telemetry container | ||
############################################################################### | ||
check program container_memory_telemetry with path "/usr/bin/memory_checker telemetry 419430400" | ||
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" | ||
if status == 3 for 10 times within 20 cycles then exec "/usr/bin/restart_service telemetry" repeat every 2 cycles | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As we discussed at last week's meeting, I am working on the PR of pytest testcase to make sure the old code will fail while new one will succeed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pytest PR is submitted for review: sonic-net/sonic-mgmt#5492. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will double check whether this new syntax existed on 201811, 201911 and 202012 images or not. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In 201811, 201911 and 202012 images, Monit has the same version |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -85,6 +85,8 @@ def check_memory_usage(container_name, threshold_value): | |
if mem_usage_bytes > threshold_value: | ||
print("[{}]: Memory usage ({} Bytes) is larger than the threshold ({} Bytes)!" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The message text from statement
|
||
.format(container_name, mem_usage_bytes, threshold_value)) | ||
syslog.syslog(syslog.LOG_INFO, "[{}]: Memory usage ({} Bytes) is larger than the threshold ({} Bytes)!" | ||
.format(container_name, mem_usage_bytes, threshold_value)) | ||
sys.exit(3) | ||
else: | ||
syslog.syslog(syslog.LOG_ERR, "[memory_checker] Failed to retrieve memory value from '{}'" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you choose the magic number 2? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeat every 2 cycles
at here means if Monit failed to reset its counter and memory usage of telemetry is still larger than the threshold (at here400MB
) for2 minutes
, then telemetry container will be restarted.I selected the number
2
since memory usage of telemetry possibly increased very quickly from MB to around GB within 2 minutes and telemetry should be restarted ASAP.Other numbers can be selected as well and do you have any suggestion?