-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update netlink messages handler #2233
Conversation
2786614
to
4f2cf4f
Compare
@liorghub Thanks for the fix. Can you please add vs tests to cover the cases where the port being part of Bridge or LAG? |
portsyncd/linksync.cpp
Outdated
if (master) | ||
{ | ||
return; | ||
LinkCache &linkCache = LinkCache::getInstance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, this would be handled by teamsyncd. Can you check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prsunny I checked, teamsyncd is handling messages being sent for the port-channel interface itself, those messages are marked with type="team". The bug I fixed concerns the handling of messages for ports that belongs to port-channel. These messages are not marked with type="team".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, @judyjoseph , can you check this? This seems to be basic change and missed. @liorghub, What is the functional impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functional impact is in LLDP, there we check state DB PORT_TABLE for "netdev_oper_status" up before sending LLDP commands. If "netdev_oper_status" is down, LLDP command is not being sent causing wrong LLDP behavior.
See the following code in lldpmgrd.
https://github.com/Azure/sonic-buildimage/blob/cc30771f6b97234a6dd19d8f97d5dfd44551cf20/dockers/docker-lldp/lldpmgrd#L170
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. lgtm. As Xu suggested, please add VS tests to cover this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, @judyjoseph , can you check this? This seems to be basic change and missed. @liorghub, What is the functional impact?
@prsunny I did a quick check .. noting down the events from syslog. I find that the 'netdev_oper_status' is set much earlier for an interface as long as the interface is connected and up. The teamd member addition happens earlier.
Apr 26 18:33:56.812132 str2---1 NOTICE swss0#orchagent: :- initializePort: Initializing port alias:Ethernet4 pid:1000000000006
Apr 26 18:33:56.817494 str2---1 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet4 admin:0 oper:0 addr:40:7c:7d:bb:26:0b ifindex:22 master:0
Apr 26 18:33:56.817741 str2---1 NOTICE swss0#portsyncd: :- onMsg: Publish Ethernet4(ok:down) to state db
Apr 26 18:33:56.818394 str2---1 NOTICE swss0#orchagent: :- addHostIntfs: Create host interface for port Ethernet4
Apr 26 18:33:56.833381 str2---1 NOTICE swss0#orchagent: :- setHostIntfsOperStatus: Set operation status DOWN to host interface Ethernet4
Apr 26 18:33:56.833450 str2---1 NOTICE swss0#orchagent: :- initPort: Initialized port Ethernet4
Apr 26 18:33:56.897841 str2---1 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet4 admin:1 oper:1 addr:40:7c:7d:bb:26:0b ifindex:22 master:0
Apr 26 18:33:56.898243 str2---1 NOTICE swss0#portsyncd: :- onMsg: Publish Ethernet4(ok:up) to state db
Apr 26 18:33:56.898260 str2---1 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet4 admin:1 oper:1 addr:40:7c:7d:bb:26:0b ifindex:22 master:2
Apr 26 18:33:56.898310 str2---1 NOTICE swss0#portsyncd: message repeated 2 times: [ :- onMsg: nlmsg type:16 key:Ethernet4 admin:1 oper:1 addr:40:7c:7d:bb:26:0b ifindex:22 master:2]
Apr 26 18:33:56.900044 str2---1 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet4 admin:1 oper:1 addr:40:7c:7d:bb:26:0b ifindex:22 master:2
Apr 26 18:33:56.901037 str2---1 INFO kernel: [ 140.005295] PortChannel102: Port device Ethernet4 added
Apr 26 18:33:56.901375 str2---1 NOTICE teamd0#teammgrd: :- addLagMember: Add Ethernet4 to port channel PortChannel102
Apr 26 18:33:56.912638 str2---1 NOTICE swss0#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet4 admin:1 oper:1 addr:40:7c:7d:bb:26:0b ifindex:22 master:2
@liorghub could you share a bit more details on when you observe this behavior -- is it seen always with lldp ? for all port channel member interfaces ( or only for interface which were initially oper down, after a while they become oper up as they become part of portchannel ? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@judyjoseph
Hi judy,
Issue happens when switch is booting.
Ethernet0 is part of port-channel.
As you can see below, portsyncd gets several netlink messages for Ethernet0,
The last message that arrives without "master" (master:0) is at 07:19:15.359655 and it is oper down.
Later we get more messages for Ethernet0 with oper up but we ignore them since they are marked with "master".
Interfaces that have master can be either part of vlan bridge or part of port-channel.
We want to ignore only vlan bridge (confirmed with @zhenggen-xu)
Since the last massage for Ethernet0 we handle is with oper down, state DB holds "netdev_oper_status" = "down", this is causing wrong LLDP behaviour.
Issue is persistent and occurs after each reboot.
See below logs:
root@r-tigon-20:/home/admin# grep -e "nlmsg type" -e Publish /var/log/syslog | egrep "Ethernet0"
Apr 28 07:19:15.287582 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:0 oper:0 addr:1c:34:da:c9:60:68 ifindex:77 master:0 type:sx_netdev
Apr 28 07:19:15.287898 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet0(ok:down) to state db
Apr 28 07:19:15.291418 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:0 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:0 type:sx_netdev
Apr 28 07:19:15.291972 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet0(ok:down) to state db
Apr 28 07:19:15.359292 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:0 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:0 type:sx_netdev
Apr 28 07:19:15.359510 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet0(ok:down) to state db
Apr 28 07:19:15.359655 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:0 type:sx_netdev
Apr 28 07:19:15.359866 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: Publish Ethernet0(ok:down) to state db
Apr 28 07:19:15.360309 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev
Apr 28 07:19:15.360352 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev
Apr 28 07:19:15.365219 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev
Apr 28 07:19:15.367925 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:0 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev
Apr 28 07:19:27.880041 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:1 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev
Apr 28 07:19:28.011930 r-tigon-20 NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet0 admin:1 oper:1 addr:1c:34:da:c9:60:00 ifindex:77 master:4 type:sx_netdev
@judyjoseph I added vs test as requested. |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
Azure.sonic-swss (BuildArm arm64) is failing in download artifacts. |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
The following tests failed, trying to rerun. test_NeighborAddRemoveIpv6LinkLocal |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
/azpw run Azure.sonic-swss |
/AzurePipelines run Azure.sonic-swss |
Azure Pipelines successfully started running 1 pipeline(s). |
@liorghub can you test the following scenario :- |
@prgeor
port part of port-channel:
port part of vlan:
For port which is part of vlan, indeed there is inconsistency between databases. |
@prgeor @zhenggen-xu |
@liorghub OK |
@liorghub thought about this a little more, I think the right fix should be changing:
to:
what we were really trying to avoid before was when the PORT was removed from bridge, we didn't want to remove the port itself. I think this should be applicable to LAG too (in case port was removed from LAG), thus above changes. This should also fix the inconsistency of the the link status across the tables as you mentioned above. My email: [email protected] , we can meet in Teams. |
…is being removed from bridge
@zhenggen-xu your fix made it work, thanks! |
@@ -215,7 +215,7 @@ void LinkSync::onMsg(int nlmsg_type, struct nl_object *obj) | |||
/* If netlink for this port has master, we ignore that for now | |||
* This could be the case where the port was removed from VLAN bridge | |||
*/ | |||
if (master) | |||
if (master && nlmsg_type == RTM_DELLINK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let change the comments session above to state that we ignore the DELLINK message if port has master, this is applicable to the case where port was part or VLAN or LAG etc. You should rename the PR title too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@zhenggen-xu , @prgeor - Now that all the comments were addressed, can you please approve this PR ? |
- What I did Ignore netlink DELLINK messages if port has master, this is applicable to the case where port was part of VLAN bridge or LAG. - Why I did it Netlink messages handler in portsyncd was ignoring all messages that had master. Therefore we ignored messages on interfaces that belong to LAG (not only interfaces belong to bridge as intended). The result was "netdev_oper_status" down in PORT_TABLE in state DB for port which is part of LAG although it is actually up. - How I verified it Check "netdev_oper_status" in PORT_TABLE in state DB for port which is part of LAG.
- What I did Ignore netlink DELLINK messages if port has master, this is applicable to the case where port was part of VLAN bridge or LAG. - Why I did it Netlink messages handler in portsyncd was ignoring all messages that had master. Therefore we ignored messages on interfaces that belong to LAG (not only interfaces belong to bridge as intended). The result was "netdev_oper_status" down in PORT_TABLE in state DB for port which is part of LAG although it is actually up. - How I verified it Check "netdev_oper_status" in PORT_TABLE in state DB for port which is part of LAG.
What I did
Ignore netlink DELLINK messages if port has master, this is applicable to the case where port was part of VLAN bridge or LAG.
Why I did it
Netlink messages handler in portsyncd was ignoring all messages that had master.
Therefore we ignored messages on interfaces that belong to LAG (not only interfaces belong to bridge as intended).
The result was "netdev_oper_status" down in PORT_TABLE in state DB for port which is part of LAG although it is actually up.
How I verified it
Check "netdev_oper_status" in PORT_TABLE in state DB for port which is part of LAG.
Details if related