-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add orchagent heart beat message for watchdog. #2737
Conversation
orchagent/orchdaemon.cpp
Outdated
@@ -958,6 +963,20 @@ void OrchDaemon::addOrchList(Orch *o) | |||
m_orchList.push_back(o); | |||
} | |||
|
|||
void OrchDaemon::heartBeat(std::chrono::time_point<std::chrono::high_resolution_clock> tcurrent) | |||
{ | |||
static auto tlast = std::chrono::high_resolution_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, change to a static member variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for misleading.
To be super safe, you can use a static member variable instead of a static function variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, change to none static member.
orchagent/orchdaemon.cpp
Outdated
@@ -64,6 +68,8 @@ event_handle_t g_events_handle; | |||
#define DEFAULT_MAX_BULK_SIZE 1000 | |||
size_t gMaxBulkSize = DEFAULT_MAX_BULK_SIZE; | |||
|
|||
std::chrono::time_point<std::chrono::high_resolution_clock> OrchDaemon::m_lastHeartBeat = std::chrono::high_resolution_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, initialized in ctor
4af1876
to
7f17fd4
Compare
…ave issue. (#14686) This PR depends on sonic-net/sonic-swss#2737 merge first. **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: sonic-net/sonic-swss#2737 UT PR: sonic-net/sonic-mgmt#8306
…ave issue. (#15429) Add watchdog mechanism to swss service and generate alert when swss have issue. **Work item tracking** Microsoft ADO (number only): 16578912 **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually test process_monitoring/test_critical_process_monitoring.py can pass. Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: sonic-net/sonic-swss#2737 UT PR: sonic-net/sonic-mgmt#8306
**What I did** Improve orch agent: output heartbeat message to systemd. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually validate the heartbeat message works correctly. **Details if related** Another inprogress PR will add watchdog for this heartbeat message: sonic-net/sonic-buildimage#14686 sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
…ave issue. (sonic-net#14686) This PR depends on sonic-net/sonic-swss#2737 merge first. **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: sonic-net/sonic-swss#2737 UT PR: sonic-net/sonic-mgmt#8306
…ave issue. (sonic-net#15429) Add watchdog mechanism to swss service and generate alert when swss have issue. **Work item tracking** Microsoft ADO (number only): 16578912 **What I did** Add orchagent watchdog to monitor and alert orchagent stuck issue. **Why I did it** Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it. **How I verified it** Pass all UT. Manually test process_monitoring/test_critical_process_monitoring.py can pass. Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly. Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log: Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes). **Details if related** Heartbeat message PR: sonic-net/sonic-swss#2737 UT PR: sonic-net/sonic-mgmt#8306
What I did
Improve orch agent: output heartbeat message to systemd.
Why I did it
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.
How I verified it
Pass all UT.
Manually validate the heartbeat message works correctly.
Details if related
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686
sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306