-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[portsorch] process only updated APP_DB fields when port is already created #3025
Conversation
/azpw run |
/AzurePipelines run |
Azure Pipelines successfully started running 1 pipeline(s). |
orchagent/portsorch.cpp
Outdated
@@ -4170,15 +4164,9 @@ void PortsOrch::doPortTask(Consumer &consumer) | |||
} | |||
} | |||
|
|||
if (!serdes_attr.empty()) | |||
if (!serdes_attr.empty() && p.m_preemphasis != serdes_attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stepanblyschak this will prevent setting of SAME serdes SI setting by OA if user updates any of the PORT table attributes in CONFIG_DB. This will unintentionally prevent applying SAME SI setting when user plugs out the optics and insert it back. To avoid this we should clear/unset the p.m_preemphasis map when OA gets notified of optics removal here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stepanblyschak another use case where you would like to re-apply same SI settings is when user wants to shut the interface followed by no-shut. This should retrain the link with same SI settings. So, you should also clear the p.m_preemphasis map during admin down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Do you mean SONiC must set same serdes SI attributes if the module is re-inserted? Is it per SAI spec? What libSAI is supposed to return when user queries for serdes attributes when the module is re-inserted?
For example, there is no such requirement on NVIDIA platform.
In case of shutdown/startup - why libSAI does not handle that? For example, we don't re-set the speed/fec/autoneg parameters when the interface is admin down, then up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor I think clearing m_preemphasis to handle plug out event could be risky because there's no guarantee orchagent will first process STATE_TRANSCEIVER_INFO_TABLE_NAME and then APP_PORT_TABLE_NAME if both their corresponding consumers have pending data.
I am working on a different solution to make serdes config entirelly dependent on xcvrd's request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stepanblyschak can you elaborate on the risk? Do you suspect OA may miss the insert/remove event of the transceiver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor No, the risk is that there's no guarantee orchagent will first process STATE_TRANSCEIVER_INFO_TABLE_NAME and then APP_PORT_TABLE_NAME if both consumers have data.
Then, serdes config won't be applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Please check the new appraoch.
I made the change so that while the port is not yet created, field-value pairs are aggregated in m_portConfigMap
, but when the port is already created only updated field-value pairs are processed.
This makes orchagent unconditionally set serdes attributes whenever xcvrd pushes them to APP_DB and also fixes the problem observed when mtu is configured. When mtu is configured, only mtu field is is_set
with the new approach and serdes isn't re-configured. Serdes configuration becomes fully controlled by xcvrd and there's no special logic in orchagent for that.
Regarding admin status change I do think it is out of scope of this PR. It is not required on Nvidia platform. Even if required by SAI spec, original serdes support implementation (looking at 202205 for example) did not handle admin status change and there was no field-value pairs aggregation so it didn't re-configure serdes on every port field update including admin status change.
1630af0
to
bd5e554
Compare
@@ -4228,6 +4264,13 @@ void PortsOrch::doPortTask(Consumer &consumer) | |||
/* create host_tx_ready field in state-db */ | |||
initHostTxReadyState(p); | |||
|
|||
// Restore admin status if the port was brought down | |||
if (admin_status != p.m_admin_state_up) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stepanblyschak why special handling for admin_status? Isn't it already taken care here
sonic-swss/orchagent/port/porthlpr.cpp
Line 736 in 9c995f0
port.admin_status.is_set = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor Setting some attributes require bringing port admin down - like speed. Hence, at the end of an update we need to bring port admin up. If we'd aggregate all fvs on update, then we don't need any special handling, admin_status is always set as per APPL_DB.
However, because this aggregation is removed we need to take a special care to restore port admin state.
orchagent/portsorch.cpp
Outdated
for (const auto &cit : kfvFieldsValues(keyOpFieldsValues)) | ||
{ | ||
auto fieldName = fvField(cit); | ||
auto fieldValue = fvValue(cit); | ||
|
||
SWSS_LOG_INFO("FIELD: %s, VALUE: %s", fieldName.c_str(), fieldValue.c_str()); | ||
|
||
fvMap[fieldName] = fieldValue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stepanblyschak can we have a function() seems repeating in line 3539 also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@prgeor fixed
@stepanblyschak can you test below sequence? Step1:- Insert a cable that needs custom SI setting on ASIC side on say port A Ensure link comes up in both above cases |
created Signed-off-by: Stepan Blyschak <[email protected]>
Signed-off-by: Stepan Blyschak <[email protected]>
@prgeor Replied over email |
@stepanblyschak , @prgeor , in general orchagent code should be idempotent. So even if same value is set again, it should pass down the value to SAI implementation. From what I understand, its causing flaps in some cases. PR is good to me, but just want to share the thought. |
…reated (sonic-net#3025) * [portsorch] process only updated APP_DB fields when port is already created What I did Fixing an issue when setting some port attribute in APPL_DB triggers serdes parameters to be re-programmed with port toggling. Made portsorch to handle only those attributes that were pushed to APPL_DB, so that serdes programming happens only by xcvrd's request to do so.
hi @mihirpat1 for cherry pick to 202305, could you have it tested with 202305 image to avoid regression? |
Hi @StormLiangMS, |
* Fixes mock test failure * Fixes mock test run failure fixes pipeline run failure FAIL: p4orch_tests_usan ======================= ../../../orchagent/vrforch.cpp:113:41: runtime error: member call on null pointer of type 'struct RouteOrch' ../../../orchagent/vrforch.cpp:113:41: runtime error: member access within null pointer of type 'struct RouteOrch' FAIL p4orch_tests_usan (exit status: 139) * Fixed orchagent crash in VM with the Qos BUFFER_QUEUE|system-port|Queue-id-range config (sonic-net#3050) * Fixed orchagent crash in VM with the Qos BUFFER_QUEUE|system-port|Queue-id-range config * [intfsorch] Enable ipv6 proxy ndp along with proxy arp (sonic-net#3045) * [intfsorch] Enable ipv6 proxy ndp along with proxy arp setting SAI_VLAN_ATTR_UNKNOWN_MULTICAST_FLOOD_CONTROL_TYPE to SAI_VLAN_FLOOD_CONTROL_TYPE_NONE when proxy arp is enabled. This fixes a bug where ipv6 NS packets were flooding ports with duplicate packets. We now set multicast flood type to none. * Fix multi VLAN neighbor learning (sonic-net#3049) What I did When adding a new neighbor, check if the neighbor IP has already been learned on a different VLAN. If it has, remove the old neighbor entry before adding the new one. Why I did it On Gemini devices, if a neighbor IP moves from an active port in one VLAN to a second VLAN, then back to the first VLAN (with 3 different MAC addresses), orchagent will crash. Even though the MAC address of the last move is different from the first MAC address, orchagent believes the last MAC address to already be programmed in the hardware and tries to set an attribute of the entry which doesn't exist. * [asan] Disable the "maybe-uninitialized" warning when compiled with ASAN enabled. * Set HOST_TX_READY_NOTIFY attribute only after query capabilities(sonic-net#3070) *Set HOST_TX_READY_NOTIFY attribute only after query capabilities * [EVPN] Skip EVPN routes with invalid VNI or router mac field (sonic-net#3073) * Skip EVPN routes with invalid VNI or router mac field * Add port flap count and last flap timestamp to APPL_DB (sonic-net#3052) * Add port flap count and last flap timestamp * Add basic fabric link monitoring counters and states handling. (sonic-net#2988) * Add basic fabric link monitoring counters and states handling. * [Mellanox] Fix inconsistence in the shared headroom pool initialization (sonic-net#3057) * Fix inconsistence in the shared headroom pool initialization * Why I did it During initialization, if SHP is enabled the buffer pool sizes, xoff have initialized to 0, which means SHP is disabled but the buffer profiles already indicate SHP later on the buffer pool sizes are updated with off being non-zero In case the orchagent starts handling buffer configuration between 2 and 3, it is inconsistent between buffer pools and profiles, which fails Mellanox SAI sanity check. To avoid it, it indicates SHP enabled by setting a very small buffer pool and SHP sizes * [acl] Add IN_PORTS qualifier for L3 table (sonic-net#3078) * Apply IN_PORTS qualifiier for L3 table Why I did it IN_PORTS qualifier was allowed for L3 table in 202012 release and below. Changes in sonic-net#1982 removed that support leading to regression in some of our testcases. The following error was observed ERR swss#orchagent: :- validateAclRuleMatch: Match SAI_ACL_ENTRY_ATTR_FIELD_IN_PORTS in rule RULE_1 is not supported by table DATAACL * [bulker] add support for neighbor bulking (sonic-net#2768) Adding support for sai_neighbor_api_t bulking in bulker.h * [buffermgrd] Move switch-statement outside of if-statement in BufferMgr::doTask (sonic-net#3055) * [buffermgr] Moved switch statement outside of if-statmement in Buffermgr::doTask The switch statement which would normally erase buffer events was moved to be inside the if-statement which would only enter if the event is a SET event. This was introduced in commit e5329c39. This would cause an infinite loop, since non-set events would never be erased. The switch statement has now been moved to occur outside the if, allowing for non-set commands to be processed. * [portsorch] process only updated APP_DB fields when port is already created (sonic-net#3025) * [portsorch] process only updated APP_DB fields when port is already created What I did Fixing an issue when setting some port attribute in APPL_DB triggers serdes parameters to be re-programmed with port toggling. Made portsorch to handle only those attributes that were pushed to APPL_DB, so that serdes programming happens only by xcvrd's request to do so. * [Copp]Refactor coppmgr tests (sonic-net#3093) What I did Refactoring coppmgr mock tests Why I did it After migration to bookworm, coppmgr tests started failing due to the use of sudo commands. * Revert "[acl] Add IN_PORTS qualifier for L3 table (sonic-net#3078)" (sonic-net#3092) This reverts commit 9d4a3ad. *Revert "[acl] Add IN_PORTS qualifier for L3 table" * [orchagent] TWAMP Light orchagent implementation (sonic-net#2927) * [orchagent] TWAMP Light orchagent implementation. (sonic-net#2927) * What I did Implemented the TWAMP Light feature according to the SONiC TWAMP Light HLD(sonic-net/SONiC#1320). * Clang format change. (sonic-net#3080) What I did This PR has no real code change. It is purely clang formatting. It only applies to the P4Orch codes. Commands that I run: find orchagent/p4orch -name *.h -o -name .cpp | xargs clang-format -i -style="{BasedOnStyle: Microsoft, DerivePointerAlignment: false}" find orchagent -name response_publisher -o -name return_code.h | xargs clang-format -i -style="{BasedOnStyle: Microsoft, DerivePointerAlignment: false}" * T2-VOQ-VS: Fix iBGP bringup issue (sonic-net#3053) * Fix iBGP bringup issue T2-vswitch * On T2-VOQ chassis Emulation with multi-asic linecards, iBGP sessions dont come up. Related Issue: sonic-net/sonic-buildimage#18129 * [Fdbsyncd] Adding extern_learn flag with fdb entry so Kernel doesn't age out (sonic-net#2985) * Adding extern_learn flag with fdb entry so that Kernel doesn't age out the MAC * [Fdbsyncd] Adding extern_learn flag with fdb entry so Kernel doesn't age out What I did extern_learn flag is added while programming the fdb entry into the Kernel. This will make sure that kernel doesn't age out the fdb entry. (#15004) How I did it A flag extern_learn will be passed while programing the fdb entry. (#15004) How to verify it Tested MAC add/del to the Kernel from the local FDB entry. (#15004) Signed-off-by: [email protected] --------- Signed-off-by: [email protected] Co-authored-by: Sudharsan Dhamal Gopalarathnam <[email protected]> * Fix oper FEC retrieval after warmboot (sonic-net#3100) Updating oper FEC status in state_db after warm-reboot as part of refresh port status call * [EVPN]Fix fpmsyncd crash when EVPN type5 is received with bgp fib suppression enabled (sonic-net#3101) * [EVPN]Fix fpmsyncd crash when EVPN type5 is received with bgp fib suppression enabled * [portsorch] Handle TRANSCEIVER_INFO table on warm boot (sonic-net#3087) * Add existing data from TRANSCEIVER_INFO table * Introduce a new role for DPU-NPU Interconnect Signed-off-by: Vivek Reddy Karri <[email protected]> Co-authored-by: Sudharsan Dhamal Gopalarathnam <[email protected]> * [p4orch] Clang format change. (sonic-net#3096) What I did [p4orch] This PR has no real code change. It is purely clang formatting. It does the same as sonic-net#3080. * [dash] fix ENI admin state update (sonic-net#3081) * [dash] fix ENI admin state update * Add force option for fabric port unisolate command (sonic-net#3089) What I did Add force option to the unisolate link command, so users can make the links not isolate if they want. depends on sonic-net/sonic-buildimage#18447 * [twamporch] Explicitly initialize local variable (sonic-net#3115) What I did Explicitly initialized local variable. Why I did it We met below error message in sonic-buildimage armhf build (sonic-net/sonic-buildimage#18334) * Add bookworm build to the PR checkers (sonic-net#3114) What I did Add a Bookworm build to the PR checkers. Also fix some Bookworm build errors that crept in. Why I did it Buildimage now builds swss for Bookworm, so the build needs to succeed. * [ACL] Remove flex counter when updating ACL rule (sonic-net#3118) What I did This PR is to fix sonic-net/sonic-buildimage#18719 When ACL rule is created for the first time, a flex counter is created and registered. When the same ACL rule is being updated, the FlexCounter created before is not removed, and another FlexCounter is created and registered. Why I did it Fix the issue that FlexCounter is duplicated when updating existing ACL rule. --------- Signed-off-by: [email protected] Signed-off-by: Vivek Reddy Karri <[email protected]> Co-authored-by: saksarav-nokia <[email protected]> Co-authored-by: Nikola Dancejic <[email protected]> Co-authored-by: Lawrence Lee <[email protected]> Co-authored-by: Oleksandr Ivantsiv <[email protected]> Co-authored-by: noaOrMlnx <[email protected]> Co-authored-by: Lior Avramov <[email protected]> Co-authored-by: Prince George <[email protected]> Co-authored-by: jfeng-arista <[email protected]> Co-authored-by: Stephen Sun <[email protected]> Co-authored-by: Neetha John <[email protected]> Co-authored-by: Amir <[email protected]> Co-authored-by: Stepan Blyshchak <[email protected]> Co-authored-by: Sudharsan Dhamal Gopalarathnam <[email protected]> Co-authored-by: xiaodong hu <[email protected]> Co-authored-by: mint570 <[email protected]> Co-authored-by: Deepak Singhal <[email protected]> Co-authored-by: KISHORE KUNAL <[email protected]> Co-authored-by: Vivek <[email protected]> Co-authored-by: Yakiv Huryk <[email protected]> Co-authored-by: Saikrishna Arcot <[email protected]> Co-authored-by: bingwang-ms <[email protected]>
…reated (sonic-net#3025) * [portsorch] process only updated APP_DB fields when port is already created What I did Fixing an issue when setting some port attribute in APPL_DB triggers serdes parameters to be re-programmed with port toggling. Made portsorch to handle only those attributes that were pushed to APPL_DB, so that serdes programming happens only by xcvrd's request to do so.
Cherry-pick PR to 202305: #3127 |
…reated (#3025) * [portsorch] process only updated APP_DB fields when port is already created What I did Fixing an issue when setting some port attribute in APPL_DB triggers serdes parameters to be re-programmed with port toggling. Made portsorch to handle only those attributes that were pushed to APPL_DB, so that serdes programming happens only by xcvrd's request to do so.
…reated (sonic-net#3025) * [portsorch] process only updated APP_DB fields when port is already created What I did Fixing an issue when setting some port attribute in APPL_DB triggers serdes parameters to be re-programmed with port toggling. Made portsorch to handle only those attributes that were pushed to APPL_DB, so that serdes programming happens only by xcvrd's request to do so.
Cherry-pick PR to 202311: #3147 |
…reated (#3025) * [portsorch] process only updated APP_DB fields when port is already created What I did Fixing an issue when setting some port attribute in APPL_DB triggers serdes parameters to be re-programmed with port toggling. Made portsorch to handle only those attributes that were pushed to APPL_DB, so that serdes programming happens only by xcvrd's request to do so.
What I did
Fixing an issue when setting some port attribute in APPL_DB triggers serdes parameters to be re-programmed with port toggling. Made portsorch to handle only those attributes that were pushed to APPL_DB, so that serdes programming happens only by xcvrd's request to do so.
Why I did it
To fix an issue seen on warm-reboot. After orchagent reconciles, portmgrd sends another update to APPL_DB PORT_TABLE to sync. That trigger port flapping as portsorch does not handle just the fields that got pushed but all fields it has cached in
m_portConfigMap
.This also causes unconditionally re-set serdes attributes on any port attribute change, e.g.:
How I verified it
By changing mtu of the interface I verified there was not port toggling and running warm-reboot test on T0 topology.
Details if related
Request for 202311