Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchagent crashed when removing router intf #17204

Closed
ysmanman opened this issue Nov 16, 2023 · 6 comments
Closed

Orchagent crashed when removing router intf #17204

ysmanman opened this issue Nov 16, 2023 · 6 comments
Assignees
Labels
Arista Triaged this issue has been triaged

Comments

@ysmanman
Copy link
Contributor

Description

We observed the following orchagent crash in 202205 sonic-mgmt testing on voq chassis.

Core was generated by `/usr/bin/orchagent -d /var/log/swss -b 1024 -s -m d4:af:f7:2e:c4:c6'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f2d53dd6ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
--Type <RET> for more, q to quit, c to continue without paging--
[Current thread is 1 (Thread 0x7f2d5364d9c0 (LWP 140))]
(gdb) bt
#0  0x00007f2d53dd6ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f2d53dc0537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x0000564d10fd5fa0 in handleSaiFailure (abort_on_failure=<optimized out>) at saihelper.cpp:765
#3  0x0000564d111f18e7 in handleSaiRemoveStatus (api=api@entry=SAI_API_ROUTER_INTERFACE, status=1740812944, status@entry=-17, context=context@entry=0x0) at saihelper.cpp:694
#4  0x0000564d110e1103 in IntfsOrch::removeRouterIntfs (this=0x564d12fb6960, port=...) at intfsorch.cpp:1303
#5  0x0000564d110e1485 in IntfsOrch::removeIntf (this=0x564d12fb6960, alias="cmp227-4|asic1|Ethernet144", vrf_id=844424930132020, ip_prefix=<optimized out>) at intfsorch.cpp:624
#6  0x0000564d110e4b50 in IntfsOrch::doTask (this=0x564d12fb6960, consumer=...) at intfsorch.cpp:1093
#7  0x0000564d11077eae in Consumer::drain (this=0x564d130355a0) at orch.cpp:241
#8  Consumer::drain (this=0x564d130355a0) at orch.cpp:238
#9  Consumer::execute (this=0x564d130355a0) at orch.cpp:235
#10 0x0000564d11067f29 in OrchDaemon::start (this=this@entry=0x564d12fc60a0) at orchdaemon.cpp:771
#11 0x0000564d11000155 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:735

We also notice this in syslog, which may imply the crash may be because refcnt of routerIntf were broken.

Oct 24 04:13:34.384355 cmp235-4 ERR swss1#orchagent: :- removeRouterIntfs: Failed to remove router interface for port cmp235-3|asic1|Ethernet144, rv:-17
Oct 24 04:13:34.387019 cmp235-4 ERR swss1#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_ROUTER_INTERFACE, status: SAI_STATUS_OBJECT_IN_USE

The crash was observed with pc/test_lag_2.py

Steps to reproduce the issue:

Describe the results you received:

Describe the results you expected:

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@ysmanman
Copy link
Contributor Author

Add @arlakshm @kenneth-arista for visibility.

@arlakshm arlakshm added Triaged this issue has been triaged Arista labels Nov 22, 2023
@arlakshm
Copy link
Contributor

not able to reproduce this crash again. Trying to reproduce

@arlakshm
Copy link
Contributor

The crash happens on remote linecard when config reload in done on Linecard.

@ysmanman
Copy link
Contributor Author

ysmanman commented Feb 2, 2024

It seems in voq chassis, the refcnt of router intf is broken. When adding a remote neighbor, addNextHop increases the refcnt of inband port https://github.com/sonic-net/sonic-swss/blob/97aa546313671c767f252f43caaeb8fe67b93224/orchagent/neighorch.cpp#L226. But when removing the remote neighbor, removeNextHop decreases the refcnt of the actual router intf https://github.com/sonic-net/sonic-swss/blob/97aa546313671c767f252f43caaeb8fe67b93224/orchagent/neighorch.cpp#L532, but not inband port. So addNextHop and removeNextHop adjusts refcnt of two different intfs. This results in router intf being removed prematurely.

@gechiang
Copy link
Collaborator

gechiang commented Feb 6, 2024

We think this PR fixes this issue: sonic-net/sonic-swss#3042
@ysmanman can you help verify with an image that has the above PR fix included to confirm?
Thanks!

@ysmanman
Copy link
Contributor Author

ysmanman commented Apr 3, 2024

We haven't seen the issue anymore with sonic-net/sonic-swss#3042.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arista Triaged this issue has been triaged
Projects
Archived in project
Development

No branches or pull requests

3 participants