Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[201911] Fix snmp subagent errors in shutdown path #259

Merged
merged 1 commit into from
May 13, 2022

Conversation

vaibhavhd
Copy link

- What I did
Fix an error raised in snmp container stop path. The issue is seen when SIGTERM is send to SNMP supervisord, and it attepmts to stop snmp-subagent.
There is a bug in subagent's shutdown path where below erros is seen upon receiving SIGTERM:

ERROR: Uncaught exception in sonic_ax_impl.main#012Traceback (most recent call last):#012  
File "/usr/local/lib/python3.6/dist-packages/sonic_ax_impl/main.py", line 72, in main#012
    event_loop.run_until_complete(agent.run_in_event_loop())#012  
File "/usr/lib/python3.6/asyncio/base_events.py", line 466, in run_until_complete#012
    return future.result()#012  
File "/usr/local/lib/python3.6/dist-packages/ax_interface/agent.py", line 49, in run_in_event_loop#012
    await asyncio.wait_for(background_task, BACKGROUND_WAIT_TIMEOUT, loop=self.loop)#012  
File "/usr/lib/python3.6/asyncio/tasks.py", line 352, in wait_for#012
    return fut.result()#012AttributeError: '_asyncio.Task' object has no attribute 'send'

This issue turned out to be very costly in warmboot shutdown path. The SIGTERM sent to subagent never shutsdown the agent as it crashed. So supervisord waits for 10s before sending SIGKILL.

In some instances even kill signal to agent fails as the subagent has crashed, this makes dockerd to wait for 10 extra seconds to send SIGKILL to the container itself.

The 20s wait in the shutdown process is not only costly in terms of time spent, but it also makes dockerd to lock resources causing other docker exec commands to timeout.

- How I did it

Add future instance to tasks list, instead of another event loop task.

- How to verify it

Tested fix on physical testbed:

Without fix:

May 11 18:44:40.924339 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:44:40,923 INFO waiting for snmp-subagent to stop
May 11 18:44:40.924921 str-msn2700-04 INFO snmp#snmp-subagent [sonic_ax_impl] INFO: Recieved 'SIGTERM' signal, shutting down... <<------------ SIGTERM sent
May 11 18:44:40.925837 str-msn2700-04 INFO snmp#snmp-subagent [ax_interface] INFO: AgentX socket connection closed.
May 11 18:44:40.926798 str-msn2700-04 INFO snmp#snmp-subagent [ax_interface] INFO: Run disabled. Connection loop stopping...
May 11 18:44:40.930051 str-msn2700-04 ERR snmp#snmp-subagent [sonic_ax_impl] ERROR: Uncaught exception in sonic_ax_impl.main#012Traceback (most recent call last):#012  File "/usr/local/lib/python3.6/dist-packages/sonic_ax_impl/main.py", line 72, in main#012    event_loop.run_until_complete(agent.run_in_event_loop())#012  File "/usr/lib/python3.6/asyncio/base_events.py", line 466, in run_until_complete#012    return future.result()#012  File "/usr/local/lib/python3.6/dist-packages/ax_interface/agent.py", line 49, in run_in_event_loop#012    await asyncio.wait_for(background_task, BACKGROUND_WAIT_TIMEOUT, loop=self.loop)#012  File "/usr/lib/python3.6/asyncio/tasks.py", line 352, in wait_for#012    return fut.result()#012AttributeError: '_asyncio.Task' object has no attribute 'send'
May 11 18:44:42.924531 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:44:42,923 INFO waiting for snmp-subagent to stop
May 11 18:44:44.924686 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:44:44,924 INFO waiting for snmp-subagent to stop
May 11 18:44:46.925024 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:44:46,924 INFO waiting for snmp-subagent to stop
May 11 18:44:48.924974 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:44:48,924 INFO waiting for snmp-subagent to stop
May 11 18:44:50.925759 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:44:50,923 WARN killing 'snmp-subagent' (60) with SIGKILL <<------------ SIGKILL sent as failed to kill within 10s of SIGTERM
May 11 18:44:50.925946 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:44:50,924 INFO waiting for snmp-subagent to stop
May 11 18:44:50.934331 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:44:50,933 INFO stopped: snmp-subagent (terminated by SIGKILL)

With fix:

May 11 18:47:14.080425 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:47:14,079 INFO waiting for snmp-subagent to stop
May 11 18:47:14.080700 str-msn2700-04 INFO snmp#snmp-subagent [sonic_ax_impl] INFO: Recieved 'SIGTERM' signal, shutting down...
May 11 18:47:14.081162 str-msn2700-04 INFO snmp#snmp-subagent [ax_interface] INFO: AgentX socket connection closed.
May 11 18:47:14.081683 str-msn2700-04 INFO snmp#snmp-subagent [ax_interface] INFO: Run disabled. Connection loop stopping...
May 11 18:47:16.080595 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:47:16,080 INFO waiting for snmp-subagent to stop
May 11 18:47:18.080922 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:47:18,080 INFO waiting for snmp-subagent to stop
May 11 18:47:18.824394 str-msn2700-04 INFO snmp#snmp-subagent [sonic_ax_impl] INFO: Goodbye!
May 11 18:47:18.932172 str-msn2700-04 INFO snmp#supervisord 2022-05-11 18:47:18,931 INFO stopped: snmp-subagent (exit status 0)

- Description for the changelog

@vaibhavhd vaibhavhd requested review from prsunny and yxieca May 11, 2022 19:03
@vaibhavhd
Copy link
Author

This issue is not seen on 202012 as this was already fixed by this PR: #130

@vaibhavhd vaibhavhd merged commit 4c6512b into sonic-net:201911 May 13, 2022
@vaibhavhd vaibhavhd deleted the fix-subagent-crash branch May 13, 2022 20:29
yxieca pushed a commit that referenced this pull request May 13, 2022
Fixed an error raised in snmp container stop path. The issue is seen when SIGTERM is send to SNMP supervisord, and it attepmts to stop snmp-subagent. There is a bug in subagent's shutdown path where below erros is seen upon receiving SIGTERM

This issue turned out to be very costly in warmboot shutdown path. The SIGTERM sent to subagent never shutsdown the agent as it crashed. So supervisord waits for 10s before sending SIGKILL.
In some instances even kill signal to agent fails as the subagent has crashed, this makes dockerd to wait for 10 extra seconds to send SIGKILL to the container itself.
The 20s wait in the shutdown process is not only costly in terms of time spent, but it also makes dockerd to lock resources causing other docker exec commands to timeout.
yxieca pushed a commit to sonic-net/sonic-buildimage that referenced this pull request May 16, 2022
Why I did it
Advance sonic-snmpagent submodule to include:
[201911] Fix snmp subagent errors in shutdown path sonic-net/sonic-snmpagent#259
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants