Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some interface take a long time to come up after reboot #135

Open
yue-fred-gao opened this issue Nov 8, 2024 · 8 comments
Open

Some interface take a long time to come up after reboot #135

yue-fred-gao opened this issue Nov 8, 2024 · 8 comments

Comments

@yue-fred-gao
Copy link
Contributor

After reboot, some times some interfaces took a long time (6, 7 minutes) to come up while the other interfaces came up quite quickly. This causes some sonic-mgmt tests failed due to timeout waiting for interface up.

@linqingxuan
Copy link
Contributor

I've basically reproduced the issue, but since I've made a lot of changes to the code locally, I'm not sure if it's caused by my modifications. I'm recompiling the image now.

@linqingxuan
Copy link
Contributor

It looks like this is more of a SONiC problem. Actually, syncd reports that the port is up, but SONiC needs to wait for the allPortsReady() signal from the kernel interface. Since portOrch is orch1 instead of orch2, the retry mechanism cannot take effect, which leads to this problem. In fact, it is better to change it on the SONiC side. I will study how to change it on the SONiC side. If not, it may be necessary to slightly change some libsai mechanisms and wait at the bottom level.

Image

@yue-fred-gao
Copy link
Contributor Author

Thanks for looking into this. I took some time to read the code and here is my understanding. Please correct me if there is any mistake.

  1. portsyncd reads port table from config_db and add them to appl_db. Once it is done, it will create entry "port_config_done" in appl_db/port table (https://github.com/sonic-net/sonic-swss/blob/master/portsyncd/portsyncd.cpp#L213).
  2. portsorch receives notification from appl_db/port. Until it receives port_config_done, it puts back the notifications of ports back to taskmap. After it receives port_config_done, it starts to process appl_db/port notifications (https://github.com/sonic-net/sonic-swss/blob/master/orchagent/portsorch.cpp#L3954).
  3. portsorch calls initPort (https://github.com/sonic-net/sonic-swss/blob/master/orchagent/portsorch.cpp#L4014), which calls create_hostif in syncd through ASIC_DB (https://github.com/sonic-net/sonic-swss/blob/master/orchagent/portsorch.cpp#L6035). After that, it sets port speed, autoneg, admin status etc.
  4. syncd receives the request and calls SAI to create the hostif.
  5. vpp vpp_create_hostif_tap_interface is called. It calls configure_lcp_interface to let linux-cp creates the tap interface.
  6. portsyncd listens to netlink event. After it receives RTM_NEWLINK of all ports, it sends PortInitDone to portsorch: https://github.com/sonic-net/sonic-swss/blob/master/portsyncd/portsyncd.cpp#L123
  7. portsorch receives above signal and should declare "allPortsReady". port is removed from m_pendingPortSet after portsorch calls initPort. I feel this is not directly related to interface oper state.

sonic-vpp sends interface oper state notification in response to intf events from vpp (https://github.com/sonic-net/sonic-platform-vpp/blob/main/saivpp/src/SwitchStateBaseRif.cpp#L497). When we see some interface stays down, can we check the state in vpp data plane, SONIC asic db, and state db? That might help finding the root cause.

@linqingxuan
Copy link
Contributor

I totally agree with you

@yue-fred-gao
Copy link
Contributor Author

yue-fred-gao commented Feb 19, 2025

What I have observed is that orchagent is slow to process notifications from syncd for port state up. From below experiment, we can see port status up notification for all 32 ports in sairedis.rec around 19:51:05-19:51:09. Portsorch started to process them about 4 minutes later. It seems it is scheduled by some timer because every 22-23 seconds, it processed one or more notifications. During this time, orchagent process was mostly in sleep state so it doesn’t look like cpu was busy.

2025-02-19.19:51:05.045357|n|port_state_change|[{"port_error_status":"1424581609","port_id":"oid:0x1000000000030","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2025-02-19.19:51:05.054938|n|port_state_change|[{"port_error_status":"1424581609","port_id":"oid:0x1000000000048","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
…
2025-02-19.19:51:09.612067|n|port_state_change|[{"port_error_status":"1424581609","port_id":"oid:0x1000000000046","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2025-02-19.19:51:09.612076|n|port_state_change|[{"port_error_status":"1424581609","port_id":"oid:0x1000000000047","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
2025-02-19.19:51:09.612086|n|port_state_change|[{"port_error_status":"1424581609","port_id":"oid:0x1000000000049","port_state":"SAI_PORT_OPER_STATUS_UP"}]|
 
2025 Feb 19 19:55:16.815651 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000030 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.715720 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000048 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.734098 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:100000000003e status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.750368 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:100000000003f status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.792168 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000041 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.836291 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000040 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:55:55.880429 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000033 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:56:20.011661 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000043 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:56:20.015587 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000042 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:56:20.019001 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000035 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:56:42.200716 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:1000000000034 status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:57:05.025276 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:100000000002a status:1 oper_error_status:0x54e963e9
2025 Feb 19 19:57:27.638853 churchill-mono-01 NOTICE swss#orchagent: :- doTask: Get port state change notification id:100000000002b status:1 oper_error_status:0x54e963e9

For the reason that I can't explain, this seems related to vpp thread configuration. Currently there is only one core dedicated to vpp. I add more cores and above issue seems gone. I tried a few times. All interfaces came up around 5-6 minutes.

@linqingxuan
Copy link
Contributor

This is different from what I observed. Could it be related to the performance of the physical machine? My current vpp machine has a 32-core CPU, and the vpp thread has 30 cores. So this may be the reason why I didn't reproduce the problem when I wrote the script at the beginning? Hahaha

@linqingxuan
Copy link
Contributor

This phenomenon can only mean that Orch is blocked, because Orch is single-threaded. If possible, can you enter the swss container and then use gdp -p orch pid to look at the BT stack?

@yue-fred-gao
Copy link
Contributor Author

This is different from what I observed. Could it be related to the performance of the physical machine? My current vpp machine has a 32-core CPU, and the vpp thread has 30 cores. So this may be the reason why I didn't reproduce the problem when I wrote the script at the beginning? Hahaha

I see. We started the vm with only 4 cores. vpp only got 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants