Some interface take a long time to come up after reboot #135

yue-fred-gao · 2024-11-08T13:38:16Z

After reboot, some times some interfaces took a long time (6, 7 minutes) to come up while the other interfaces came up quite quickly. This causes some sonic-mgmt tests failed due to timeout waiting for interface up.

linqingxuan · 2025-01-24T02:31:07Z

I've basically reproduced the issue, but since I've made a lot of changes to the code locally, I'm not sure if it's caused by my modifications. I'm recompiling the image now.

linqingxuan · 2025-02-13T01:57:20Z

It looks like this is more of a SONiC problem. Actually, syncd reports that the port is up, but SONiC needs to wait for the allPortsReady() signal from the kernel interface. Since portOrch is orch1 instead of orch2, the retry mechanism cannot take effect, which leads to this problem. In fact, it is better to change it on the SONiC side. I will study how to change it on the SONiC side. If not, it may be necessary to slightly change some libsai mechanisms and wait at the bottom level.

yue-fred-gao · 2025-02-13T19:27:44Z

Thanks for looking into this. I took some time to read the code and here is my understanding. Please correct me if there is any mistake.

portsyncd reads port table from config_db and add them to appl_db. Once it is done, it will create entry "port_config_done" in appl_db/port table (https://github.com/sonic-net/sonic-swss/blob/master/portsyncd/portsyncd.cpp#L213).
portsorch receives notification from appl_db/port. Until it receives port_config_done, it puts back the notifications of ports back to taskmap. After it receives port_config_done, it starts to process appl_db/port notifications (https://github.com/sonic-net/sonic-swss/blob/master/orchagent/portsorch.cpp#L3954).
portsorch calls initPort (https://github.com/sonic-net/sonic-swss/blob/master/orchagent/portsorch.cpp#L4014), which calls create_hostif in syncd through ASIC_DB (https://github.com/sonic-net/sonic-swss/blob/master/orchagent/portsorch.cpp#L6035). After that, it sets port speed, autoneg, admin status etc.
syncd receives the request and calls SAI to create the hostif.
vpp vpp_create_hostif_tap_interface is called. It calls configure_lcp_interface to let linux-cp creates the tap interface.
portsyncd listens to netlink event. After it receives RTM_NEWLINK of all ports, it sends PortInitDone to portsorch: https://github.com/sonic-net/sonic-swss/blob/master/portsyncd/portsyncd.cpp#L123
portsorch receives above signal and should declare "allPortsReady". port is removed from m_pendingPortSet after portsorch calls initPort. I feel this is not directly related to interface oper state.

sonic-vpp sends interface oper state notification in response to intf events from vpp (https://github.com/sonic-net/sonic-platform-vpp/blob/main/saivpp/src/SwitchStateBaseRif.cpp#L497). When we see some interface stays down, can we check the state in vpp data plane, SONIC asic db, and state db? That might help finding the root cause.

linqingxuan · 2025-02-18T13:05:50Z

I totally agree with you

yue-fred-gao · 2025-02-19T23:17:46Z

What I have observed is that orchagent is slow to process notifications from syncd for port state up. From below experiment, we can see port status up notification for all 32 ports in sairedis.rec around 19:51:05-19:51:09. Portsorch started to process them about 4 minutes later. It seems it is scheduled by some timer because every 22-23 seconds, it processed one or more notifications. During this time, orchagent process was mostly in sleep state so it doesn’t look like cpu was busy.

2025-02-19.19:51:05.045357|n|port_state_change|[{"port_error_status":"1424581609","port_id":"oid:0x1000000000030","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2025-02-19.19:51:05.054938|n|port_state_change|[{"port_error_status":"1424581609","port_id":"oid:0x1000000000048","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
…
2025-02-19.19:51:09.612067|n|port_state_change|[{"port_error_status":"1424581609","port_id":"oid:0x1000000000046","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2025-02-19.19:51:09.612076|n|port_state_change|[{"port_error_status":"1424581609","port_id":"oid:0x1000000000047","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2025-02-19.19:51:09.612086|n|port_state_change|[{"port_error_status":"1424581609","port_id":"oid:0x1000000000049","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
 
2025 Feb 19 19:55:16.815651 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000030 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.715720 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000048 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.734098 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:100000000003e status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.750368 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:100000000003f status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.792168 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000041 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.836291 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000040 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.880429 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000033 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:56:20.011661 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000043 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:56:20.015587 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000042 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:56:20.019001 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000035 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:56:42.200716 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000034 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:57:05.025276 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:100000000002a status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:57:27.638853 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:100000000002b status:1 oper_error_status:0x54e963e9

For the reason that I can't explain, this seems related to vpp thread configuration. Currently there is only one core dedicated to vpp. I add more cores and above issue seems gone. I tried a few times. All interfaces came up around 5-6 minutes.

linqingxuan · 2025-02-20T01:19:30Z

This is different from what I observed. Could it be related to the performance of the physical machine? My current vpp machine has a 32-core CPU, and the vpp thread has 30 cores. So this may be the reason why I didn't reproduce the problem when I wrote the script at the beginning? Hahaha

linqingxuan · 2025-02-20T01:37:29Z

This phenomenon can only mean that Orch is blocked, because Orch is single-threaded. If possible, can you enter the swss container and then use gdp -p orch pid to look at the BT stack?

yue-fred-gao · 2025-02-25T21:19:39Z

This is different from what I observed. Could it be related to the performance of the physical machine? My current vpp machine has a 32-core CPU, and the vpp thread has 30 cores. So this may be the reason why I didn't reproduce the problem when I wrote the script at the beginning? Hahaha

I see. We started the vm with only 4 cores. vpp only got 1.

yue-fred-gao mentioned this issue Jan 22, 2025

lacp not implemented #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some interface take a long time to come up after reboot #135

Some interface take a long time to come up after reboot #135

yue-fred-gao commented Nov 8, 2024

linqingxuan commented Jan 24, 2025

linqingxuan commented Feb 13, 2025

yue-fred-gao commented Feb 13, 2025

linqingxuan commented Feb 18, 2025

yue-fred-gao commented Feb 19, 2025 •

edited

Loading

linqingxuan commented Feb 20, 2025

linqingxuan commented Feb 20, 2025

yue-fred-gao commented Feb 25, 2025

Some interface take a long time to come up after reboot #135

Some interface take a long time to come up after reboot #135

Comments

yue-fred-gao commented Nov 8, 2024

linqingxuan commented Jan 24, 2025

linqingxuan commented Feb 13, 2025

yue-fred-gao commented Feb 13, 2025

linqingxuan commented Feb 18, 2025

yue-fred-gao commented Feb 19, 2025 • edited Loading

linqingxuan commented Feb 20, 2025

linqingxuan commented Feb 20, 2025

yue-fred-gao commented Feb 25, 2025

yue-fred-gao commented Feb 19, 2025 •

edited

Loading