-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port status not reflected in SONiC #4646
Comments
In 201911 branch, issue is not seen. |
@ciju-juniper, it's a little bit not clear what the issue is still. Please make it more clear what is not expected, and make the formatting more clear. Thanks. |
@rlhui When the cables are connected/removed in the switch ports, the link status is not updated in the kernel. The files '/sys/class/net/Ethernet*/carrier' are not updated whenever the OIR is done. Due to this problem, 'show interfaces status' is not reflecting the link status. |
@ciju-juniper , can you provide the dump file after you did trail 2 for the bad image. I need to look at the sairedis log to see if port status notification is generated or not. In your current dump, it seems it is for trail 1. as you can see, it only has up notification. when you unplug Ethernet4, we should receive a port down notification from SAI, which I cannot find in the sairedis log.
|
@lguohan The dump was captured after Trial-2. As you can see '/sys/class/net/Ethernet4/carrier' still reports '1' even after the cable pull out. We can try recreating the problem and capture the dump again. Needs a day or two as we don't have physical access to the box these days. Could you tell me which driver in the kernel find out the link status change and update the '/sys/class/net/Ethernet4/carrier'? |
the mismatch here is that if you look at the broadcom.ps file, it says ce12 and ce13 up. In the trail-2, i think it should be ce3 to be up according to your good trail log. So, it seems after you swap the cable, ce3 is not up from broadcom.ps file. that is why I am not sure if you have provide the right capture after trail-2.
ethernet4 is the broadcom knet driver. when there is link down, sai will send a notification to orchagent, and orchagent will call another SAI api to set the carrier down. From your log, i do not see the sai send the notification. |
@lguohan Please see the attached test logs and the dump from the bad image. |
@lguohan Look like there are no SAI notifications after Trial-2
|
so, it looks like a broadcom SAI issue. if you bring down the port on the other side, will the SAI generate the notification? |
@lguohan We had tested by connecting DAC cables in loopback within the same box. Are you asking to do an 'ifconfig Ethernet0 down' in this loopback setup? Or connect the cables between two boxes and check? |
basically, I am asking to simulate the port oper status down and check if sai sends a notification to upper layer or not. |
We tried to simulate the port oper status down by executing below command to make the port 1 down We are seeing the issue in latest SONIC image. Attached is jenkins290_logs file. This file has the logs captured for latest SONIC Jenkins 290 image on the box. In the attached log, the issue is seen after Trial 5. SAI is not sending notification to upper layer after few iterations. We also loaded "201911" release image on the box and repeated the same exercise. We don't see issue with this image. In this image, we see SAI sends notification to upper layer. we did many iterations of simulating the link status up/down and didn't observe the issue. |
@lguohan What could be the next steps to debug this issue further? |
@lguohan How do we take this issue forward? Could we talk to BRCM to see what is going wrong with SAI? |
@rlhui I see a mail from you regarding SAI 1.6.2 release. Would you know if this problem is fixed in that? |
@smaheshm I see your commit for SAI 1.6.1. Are you aware of this issue? |
No, not aware of this issue. Let me check the status of 1.6.2 release. Will open a case with BRCM if required. |
@smaheshm We verified this issue with SAI 1.6.1. Problem is still seen. Please open a case with BRCM. This is very easy to re-create and very basic. |
@ciju-juniper Stay tuned for updated BRCM-SAI debian package. We have to check if the issue persists in the updated debian package and then open a case. |
@ciju-juniper, would you please confirm/clarify if this issue is seen ONLY with cables connecting ports within the same device? Have you tried the same step, but with cables connecting ports of two separate boxes? Thanks. |
@rlhui We had tried with a different switch at the other end. Issue is seen when the remote is down or the cable is pulled out from the other end. Port status is not changed in SONiC and no SAI events detected. |
@ciju-juniper I have tried to reproduce this same issue using master.320 image and flapped the link over 30 times but not able to hit the same issue you reported. May I ask for your help on the following:
Please perform the above and collect the info that I requested so that I can see if I can reproduce it and work with BRCM SAI team to solve this issue. |
@gechiang Problem is seen on master.320 image. We had tried to recreate the issue with different boxes and different cables. Same behaviour. Our test is very simple. Initially 100G DAC cable was connected to Ethernet8 & Ethernet12.
After that cable was moved to Ethernet16. Now it's connected between Ethernet8 & Ethernet16.
There are no SAI event messages for port down and up events. 'show interface status' still shows ports Ethernet8 & Ethernet12 are up, even though there is no cable connections. We can see that ports (Ethernet8 & Ethernet16) are up in BCM shell
|
@gechiang Here are the o/p that you have asked with master.320 image. This is from a TH2 based platform
1.b). Issue the cmd "bsv" to collect the SAI version output.
1.c). Issue the cmd "show unit" to collect the ASIC version output.
1.d). Issue the cmd "ver" to collect the BRCM SDK version output.
2). Please try using another cable other than the one that you were able to reproduce the issue and see if issue still persists. |
@gechiang Here are the o/p that you have asked with 201911 based image. This is from a TH2 based platform
1.b). Issue the cmd "bsv" to collect the SAI version output.
1.c). Issue the cmd "show unit" to collect the ASIC version output.
1.d). Issue the cmd "ver" to collect the BRCM SDK version output.
2). Please try using another cable other than the one that you were able to reproduce the issue and see if issue still persists. |
@gechiang Here are the o/p that you have asked with master.320 image. This is from a TH1 based platform. Issue is seen on TH2 platform as well.
1.b). Issue the cmd "bsv" to collect the SAI version output.
1.c). Issue the cmd "show unit" to collect the ASIC version output.
1.d). Issue the cmd "ver" to collect the BRCM SDK version output.
|
@ciju-juniper Thank you for providing the detail version information. I was able to reproduce the same issue with our lab DUT that uses the TH1 (BCM56960_B1 ) chip. I have gathered the necessary information and have contacted BRCM to investigate this issue. Will update this thread when I have more information. |
@BaluAlluru Here is my investigation response to your TH1 test result that I read from sonic_dump_sonic_20200707_183916.tar.gz: Jul 7 18:24:44.235976 sonic INFO pmon#supervisord: xcvrd Process Process-1: One of the xcvrd thread defunct (crashed) right after you unplugged from port 1 (Ethernet4) to move that connection to port 2 (Ethernet8). root 4114 0.2 0.1 152960 21704 pts/0 Sl 18:22 0:02 /usr/bin/python /usr/bin/xcvrd After this occurred, all subsequent link bounce activities started to show problem where no more port state notification from SAI to SONiC application. Not sure if this xcvrd/platform driver issue may have some how contributed to SAI not behaving properly. Also, in the syslog I see that there were lots of "Got sfp removed events". It seems that for some reason the platform driver may be glitching and causing many SFP removed events where there is maybe just one unplug/plug from port 1 to connect to port 2? Also there is another event from the syslog that shows some issue: |
@BaluAlluru Thanks for experimenting with the shut/startup experiment on Ethernet4. According to SAI dump: But according to sairedis rec the cmd was given to SAI: But for some reason SAI did not handle this ADMIN UP request properly and resulted to what you observed... In the other trial where you indicated it worked fine Ialso took a look and it seems that the admin shut/up were all handled by SAI correctly and thus no issue: cat sairedis.rec | grep -i oid:0x100000000000f cat sairedis.rec | grep -i "oid:0x100000000000e" I also agree with you that it is inconclusive due to the one failure you observed. But it did appears that if you don't manually move the cable it is definitely behaving better that with cable unplug/plug handling... In any case, I have already opened a CSP against BRCM and provided the dumps for them to investigate. Let's see what they find... In case they want to try out something and since I am not able to reproduce it here locally, I may have to come to you to get that for them... |
@gechiang Thanks! We will investigate the xcvrd issue. Look like there were some recent changes went into xcvrd. After that we will again experiment with TH1 platform. Please let us know if broadcom requests for any additional data from TH2 platform. @lguohan @gechiang There is an orchagent crash in the latest images #4907 There is a bcm_knet driver crash and BCM asic initialization has failed. Look like a problem with bcm/SAI. |
@ciju-juniper @BaluAlluru BRCM has requested another repro with "link+ " enabled to help capture more debug information so that they can debug this issue. Can you please use master.345 image (the latest of master branch) and once your setup boots up, go to BCM shell and issue the following cmd: drivshell>debug link + Ctrl-c After this, then try your steps to move cable to reproduce the issue. BTW, I am really suspecting that the platform drive that caused the xcvrd issue may have somehow degraded the SAI behavior in someway... so appreciate if you can work from that side to see if there are new progress can be made. Since you do not have this issue running 20191130 based image. Is your platform drive different in 20191130 compared with master branch based version? If they are different, is it possible that you can use the platform driver that is from 20191130 while running the master based image if this is not too difficult for you? This will help eliminate the platform driver as a potential suspect of the root cause... |
@gechiang, Trial 1 from "QFX5200_Jenkins338_logs" is captured right after box is loaded with this image and 100G DAC cables are connected between port 0 and port 1. root@sonic:/home/admin# cat /var/log/swss/sairedis.rec | grep -i "SAI_PORT_OPER_STATUS_" As suggested by you, we will test on master.345 with broadcom ASIC debugs enabled and share the logs. |
@gechiang, We tested Jenkins 345 image on TH1 platform. OIR sequences performed on TH1 platform. There is no xcvrd crash with this image. |
@BaluAlluru Thanks for trying out the new image and set the new BCM debug cmd to capture the show tech dump file. |
@gechiang, Today we tested Jenkins 345 image on TH2 platform. OIR sequences performed on TH2 platform. There is no xcvrd crash with this image. |
@BaluAlluru BRCM is requesting for a repro with SAI log level setting change. Unfortunately the swssloglevel setting for SAI components are currently broken in master branch. See the following issue I raised for more details: https://github.com/Azure/sonic-swss/issues/1348 We will wait for 1348 issue to be resolved and merged into master branch and then let's capture the info needed by BRCM to debug this issue further. |
@BaluAlluru @ciju-juniper |
@BaluAlluru @ciju-juniper BRCM team is available to perform live debug on your failing DUT for the followingtwo time slots next week: |
@gechiang We work out of Bangalore, India. Due to ongoing pandemic issues, we are working remotely. So we need to talk to our LAB admins and get a time-slot during our day time. We will get back to you on this. Where is the BRCM team located? |
@gechiang This is my mail id for scheduling meeting crajank[at]juniper[dot]net |
@ciju-juniper I understand that your lab was impacted by COVID19 few weeks ago and thus not able to provide the information that we have requested via offline email discussion. The request from BRCM/me was to load a master branch based gdb image and set up two GDB breakpoints that we provided via the email so that you can help to see during your trigger which one of those break points get hit. This information will help us narrow down further to which specific area the issue may reside. |
@gechiang In the first instance, loaded Jenkins 404 image on TH1 platform, connected 100G DAC cable back to back in different ports on TH1 platform and tested for OIR. In the 2nd Instance,loaded Jenkins 404 image on TH1 platform and TH2 platform, connected 100G DAC cable between TH1 platform ports and TH2 platform ports. |
@BaluAlluru Thanks so much for getting back to working on this again. I have analyzed both logs that you captured. On surface although the end result of the link status seems correct, but there are some transitions that is not being detected correctly based on what you captured. The one on TH1-OIR only looks fine. But I do have some questions about the TH1-TH2-OIR test result. For the TH1-TH2-OIR test result I noticed that when you moved the cable on one platform's port to a different port (let's take the first transition port 0 to port 1 on TH2 case), the output shows TH1 did not experience a link flap while TH2 did experience a link flap. I don't think this is the correct behavior. Based on your captured output it seems to me that the link bounce detection is based on physical SFP unplug/plug instead of true link bounce (laser off/on) detection. But this may have to do with how you moved this cable. Can you describe in more details on how the cable move is performed? Is it manually (physically unplug from one port and then inserted to a different port) or it is though some automated patch panel where the signal is rerouted from one port to another port? If it is manually, then we need to chase down why on TH1 when the cable is not removed but the other end (TH2) moved it is not able to detect the link bounce that is happening. If it is the automated patch panel case, then it may be related how this patch panel signal routing may have contributed to this behavior and it would be out of our scope to debug this further. Can you please clarify on this? |
@gechiang OIR was done manually. We will try again that scenario and try to get the gdb attached. |
@ciju-juniper Thanks for confirming that it is "manually" unplug/plug of the cable being performed. |
@gechiang
Attached are the updated logs collected today. Please check Trial 4 in the attached logs. root@sonic:/home/admin# cat /var/log/swss/sairedis.rec | grep -i "SAI_PORT_OPER_STATUS_" |
@BaluAlluru Thanks a lot for trying out the suggested triage steps. Indeed this looks like a problem in BRCM SAI of handling link de-bounce. there seems to be 2 issues that we need some answer from BRCM SAI team based on your new test result:
Both of these issues appears to be on the TH1 device ONLY. TH2 behaved correctly on all 3 logs you captured. One more favor to ask. Can you capture the SAI version and the ASIC version of your TH1 device in BRCM shell? With these additional info I will be able to open a new case for this. BTW, I don't think this current issue should block your testing progress for now... You will have to perform the link test with cable out for a longer time when dealing with TH1 device for now and the end result should be correct. |
@gechiang root@sonic:/home/admin# bcmcmd bsv root@sonic:/home/admin# bcmcmd "show unit" root@sonic:/home/admin# bcmcmd "ver"
PHYs: BCM5400, BCM54182, BCM54185, BCM54180, drivshell> |
@BaluAlluru I have filed BRCM CSP case CS00011146863 |
@BaluAlluru BRCM has requested for the following information to help debug this issue. Please repeat the exact experiment you did with cable unplugged for longer time between TH1 and TH2 DUTs but include the following: BRCM also requested the following: |
@gechiang, We enabled Broadcom ASIC logs and repeated the same exercise. We couldn't recreate the issue after repeated tries. As part of our interoperability testing, we did the same OIR testing exercise between 2 TH1 boxes. We will update in this issue tracker, if we are able to recreate the issue. |
@BaluAlluru The two log files you sent looks pretty good. No issues at all as you also confirmed. "Is the TH1 box (QFX5200-32C-S) using an external phy or internal phy? If external, what is it?" Thanks! |
@gechiang TH1 box (QFX5200-32C-S) is having a PHYLESS design. Tomahawk-1 BCM56960 chip that has 32 Falcon core (FC). All the FCs are directly connected to 32 zQSFP+ ports. |
@gechiang Tried multiple times, but couldn't hit the issue. Attached are the logs for reference. |
@ciju-juniper @BaluAlluru Thanks for supplying additional trials and clarifications. |
Description
This is an issue observed in the latest SONiC images. When the cables are connected/removed in the switch ports, interface status is not correctly reflected in 'show interfaces status'. Link status is updated correctly in the broadcom side. Changes in the link state are not updated in the kernel. The files '/sys/class/net/Ethernet*/carrier' are not updated whenever the OIR is done. After a system reboot, the link status is updated in the kernel and SONiC is able to report the status correctly.
We tried to narrow down this problem to a specific kernel version, but the SONiC builds are broken when we go back to couple of weeks / months due to certain package versions missing.
The last commit on which we didn't observe this issue was on Feb-26 and there are quite significant changes happened in 'drivers/net' directory of the kernel in the span of 1.5 months.
Please have a look on this issue and suggest how to debug further.
Testing environment
Switch: QFX5200-32C-S
ASIC: TH1
Branch: master
Link is configured with 100G and the DAC cable is connected back to back with the ports in the same switch. No platform specific drivers are loaded apart from ASIC configuration files in 'device' directory.
Here are the logs from problematic image (May 20th Jenkins image)
Trial 1
100G DAC Cable connected to port 0 and port 1 after the box is rebooted.
Also "carrier" parameter is updated in "sys" directory for the ports connected with cables
Trial 2
100G DAC Cable connected to port 0 and connected to port 31.
BCMCMD also shows link up in bcm asic for these 2 ports
"carrier" parameter is not updated in "sys" directory for the ports connected with cables, still showing carrier up for port 1
Here is the dump from problematic image:
sonic_dump_sonic_20200526_071154.tar.gz
Last working commit from master branch: 1ef7403
Here are the logs from the working kernel image:
Trial 1
DAC cable is connected between physical port-0 & port-1 and system is rebooted.
Trial 2
100G DAC Cable connected to port 0 and connected to port 31.
Here is the dump from working image:
sonic_dump_sonic_20200522_143911.tar.gz
The text was updated successfully, but these errors were encountered: