-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some interface take a long time to come up after reboot #135
Comments
I've basically reproduced the issue, but since I've made a lot of changes to the code locally, I'm not sure if it's caused by my modifications. I'm recompiling the image now. |
It looks like this is more of a SONiC problem. Actually, syncd reports that the port is up, but SONiC needs to wait for the allPortsReady() signal from the kernel interface. Since portOrch is orch1 instead of orch2, the retry mechanism cannot take effect, which leads to this problem. In fact, it is better to change it on the SONiC side. I will study how to change it on the SONiC side. If not, it may be necessary to slightly change some libsai mechanisms and wait at the bottom level. |
Thanks for looking into this. I took some time to read the code and here is my understanding. Please correct me if there is any mistake.
sonic-vpp sends interface oper state notification in response to intf events from vpp (https://github.com/sonic-net/sonic-platform-vpp/blob/main/saivpp/src/SwitchStateBaseRif.cpp#L497). When we see some interface stays down, can we check the state in vpp data plane, SONIC asic db, and state db? That might help finding the root cause. |
I totally agree with you |
What I have observed is that orchagent is slow to process notifications from syncd for port state up. From below experiment, we can see port status up notification for all 32 ports in sairedis.rec around 19:51:05-19:51:09. Portsorch started to process them about 4 minutes later. It seems it is scheduled by some timer because every 22-23 seconds, it processed one or more notifications. During this time, orchagent process was mostly in sleep state so it doesn’t look like cpu was busy.
For the reason that I can't explain, this seems related to vpp thread configuration. Currently there is only one core dedicated to vpp. I add more cores and above issue seems gone. I tried a few times. All interfaces came up around 5-6 minutes. |
This is different from what I observed. Could it be related to the performance of the physical machine? My current vpp machine has a 32-core CPU, and the vpp thread has 30 cores. So this may be the reason why I didn't reproduce the problem when I wrote the script at the beginning? Hahaha |
This phenomenon can only mean that Orch is blocked, because Orch is single-threaded. If possible, can you enter the swss container and then use gdp -p orch pid to look at the BT stack? |
I see. We started the vm with only 4 cores. vpp only got 1. |
After reboot, some times some interfaces took a long time (6, 7 minutes) to come up while the other interfaces came up quite quickly. This causes some sonic-mgmt tests failed due to timeout waiting for interface up.
The text was updated successfully, but these errors were encountered: