Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CMIS]Improved 400G link bring up sequence #254

Merged
merged 11 commits into from
Jun 28, 2022

Conversation

prgeor
Copy link
Collaborator

@prgeor prgeor commented Apr 11, 2022

Description

Improved 400G link bring up sequence HLD

  1. Ensure CMIS compliant transceiver Datapath are initialized and activated IFF the host Tx signal is good and admin_status is UP
  2. get_all_namespace() API is not reentrant resulting in crash, so we cache the namespace at the xcvrd start
  3. Datapath is deactivated if either 'admin_status' becomes down or 'host_tx_ready' signal becomes OFF.
  4. An option "--skip_xcvrd_cmis_mgr" is added to skip launch of CMIS task manager for platforms that are in transition to using SFP-refactor code
  5. Datapath is NOT impacted if Xcvrd crash/restarts

Motivation and Context

Deterministically improve the link bring up of 400G link that are using CMIS compliant transceivers.

How Has This Been Tested?

Tested on

  1. Cisco 8800 chassis: Being validated by Cisco
  2. Dell Z9332F

Additional Information (Optional)

keys = port_tbl.getKeys()

if lport in keys:
cmd = 'sonic-db-cli STATE_DB hget "PORT_TABLE|{}" host_tx_ready'.format(lport)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pmon is global container, how this wok under multi NPU? I mean sonic-db-cli does not require to look namesapce db ?

Copy link

@shyam77git shyam77git Apr 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pmon is global container, how this wok under multi NPU? I mean sonic-db-cli does not require to look namesapce db ?

@prgeor updated that he would make this thing namespace aware i.e. xcvrd would subscribe/listen to all namespace's stateDB and then act accordingly. This piece yet to be added.
Prince may comment/update further

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sacnaik @shyam77git please check the updated code

# D.2.2 Software Deinitialization
api.set_datapath_deinit(host_lanes)
api.set_lpmode(True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are changing generic code. Isn't this set_lpmode is required function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each Datapath is independent, this code will bring the whole module to low power 8 times if we have 8 data paths being initialized.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like this PR will perform CMIS application selection one by one for each datapath(i.e. logical port on the MAC) instead of a one-shot configuration for all the datapatches on a transceiver.

What's unclear to me is as follows:

  1. Is there a switch ASIC available to support asynchronous mixed port mode in a port macro? (i.e. Is it possible to have both 1x100G-NZR + 2x100G-PAM4 on a single QSFPDD port, from MAC perspective?)
  2. Is there a CMIS transceiver available to support asynchronous mixed port mode in a single module?

In the above scenario, each datapath will require a dedicated clock divisor while they all share the same reference clock source. As far as I know, this is NOT a supported port mode in Broadcom ASIC, but I'm not sure if this is also applicable to other vendors, could you please clarify if there is a such module available for testing? Thank you

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xcvr is lisenting to various port change events related to a ports speed, lane mapping etc and these events across various logical ports in the case of breakout port (say one physical port mapping to 4 logical ports) will arrive asynchronously, so we need to handle these events independently, even though it belong to same physical port.

self.port_dict[lport]['cmis_state'] = self.CMIS_STATE_READY
continue

# Skip if it's not a paged memory device
if api.is_flat_memory():
self.log_notice("{}: skipping CMIS state machine for flat memory xcvr".format(lport))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be periodically logging this, then it is not right.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the state machine moves to CMIS_STATE_READY after this, and doesnot come again here unless the state is changed.

# NOTE: Some CMIS compliant modules may have 'auto-squelch' feature where
# the module won't take datapaths to Activated state if host tries to enable
# the datapaths while there is no good Tx signal from the host-side.
if not is_host_tx_ready(lport):
Copy link

@jaganbal-a jaganbal-a Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be implemented as a event handler?

  1. Subscribe to PORT_TABLE in state-db, then filter to host_tx_ready field.
  2. Based on the host_tx_ready status, set the appropriate CMIS module state will be set here in the event handler.
  3. Since the main CMIS task worker keep checking for CMIS state change for all ports, once the point (2) is set, the CMIS task worker will work on the appropriate state.
  4. we can introduce a new state between CMIS_STATE_AP_CONF and CMIS_STATE_DP_INIT, say 'CMIS_STATE_WAIT_FOR_HOST' . so during module insertion the CMIS task worker will set to this state once it completes CMIS_STATE_AP_CONF and this state will be changed to CMIS_STATE_DP_INIT only in the host_tx_ready handler. this way it will be synchornized.

Keeping the host_tx_ready processing as event handler will have following benefit.

  1. Keeping the host_tx_ready out of CMIS task worker to event handler will benefit other speed to follow the deterministic link bringup approach.
  2. It will avoid continuous polling of host-tx-ready using Redis-DB query, which will save time when scaled to fully populated 48/72 ports.

The host_tx_ready event handler can invoke the xcvr_API_factory API say 'tx_disable' which will set appropriate CMIS state in case of CMIS module and it will disable/enable laser/Tx appropriately in case of non-cmis module, which can be leveraged eventually.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaganbal-a I think event handling may NOT work when Xcvrd restarts. Consider this:-

  1. 400G Port link is up.
  2. Xcvrd crashes and in the meantime someone plugs in another 400G module
  3. Xcvrd won't get another 'host_tx_ready' event because 'host_tx_ready' doesn't depend upon whether module is present or not. So, Xcvrd will keep waiting for 'host_tx_ready' or unless user does admin shut followed by admin up.

Another similar case as above

  1. Xcvrd is running and got 'host_tx_ready' but module is absent
  2. Xcvrd crashes and comes up
  3. Module is plugged in but Xcvrd is not going to receive 'host_tx_ready' and link won't come up

Copy link

@jaganbal-a jaganbal-a Apr 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prince
It is captured as part of design that the host_tx_ready will be updated regardless of transceiver presence, and it is xcvrd job to check the host_tx_ready during module Init during insertion (a local host_tx_ready dictionary will help here but not polling), so the DP state will be moved to DP_INIT/DP_ACTIVATE state based on host_tx_ready. Above mentioned cases will be covered.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaganbal-a Xcvrd process don't have a persistent storage to capture the 'host_tx_ready' state, so it depends upon the Redis-DB i.e it needs to read the 'host_tx_ready' state from the DB. Let Xcvrd read the state when it requires rather than being notified. Because if we have to take the approach of notification, then we need to handle the above Xcvrd exit case separately by reading the DB which can be avoided if Xcvrd can read the DB when required.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor

Let Xcvrd read the state when it requires rather than being notified.

Can you clarify as to what is meant by 'when it requires'?
Does it mean continuous polling of host-tx-ready using Redis-DB query OR what?
When interface/port moves from config shut to no shut or vice-versa, won't it a notification of host_tx_ready from redisDB otherwise how would it xcvrd be notified of this? repeated/continuous polling indefinitely without any event happening in the system is waste of cpu cycles / system resources.
Also config interface shut/no shut are rare events along with optics removal/insertion
Also, per HLD we decided to go with notification based mechanism of host_tx_ready.
To handle xcvrd restart case, as Jagan mentioned above, better to check the host_tx_ready during module Init during insertion as source of truth.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shyam77git @jaganbal-a please take a look again as i updated the code.

@prgeor prgeor marked this pull request as ready for review April 19, 2022 23:22
@prgeor
Copy link
Collaborator Author

prgeor commented Apr 19, 2022

@ds952811 can you review

@prgeor prgeor changed the title Improved 400G link bring up sequence [CMIS]Improved 400G link bring up sequence Apr 19, 2022
@@ -870,9 +872,11 @@ def __init__(self, port_mapping):
self.task_process = None
self.port_dict = {}
self.port_mapping = copy.deepcopy(port_mapping)
self.xcvr_table_helper = XcvrTableHelper()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If XcvrTableHelper is implemented as a class member, it should be initialized in task_worker instead of __init__.
This is because CmisManagerTask is a subprocess wrapper, which means a subprocess will be forked once task_run is called.
If it is initialized in __init__, the flow is

  • parent creates the XcvrTableHelper along with all the low level objects, like the sockets based on which the tables are accessed
  • and then the child is forked, all resources opened by the parent are duplicated
  • now there are two copies of all the resources but the parent's one will not really be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, initializing it in __init__ and then duplicating it from parent to child should also work. But it looks like taking a detour to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephenxs what is the concern here? only the worker thread is going to use the socket, not the parent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xcvr_table_helper is initialized in the parent and then duplicated in the client. but it will never be used by the parent.
so the resource allocated in the parent is wasted.

  • if it is initialized in __init__:
    • xcvr_table_helper initialized in parent. resource allocated for it but not used.
    • duplicated in the child during fork
  • if it is initialized in task_worker
    • xcvr_table_helper is initialized in the child before it starts to work.
    • no resource was allocated in the parent.

as I mentioned, it also works in the current way. so I'm ok if you do not decide to change it.

return
# 'index' can be -1 if STATE_DB|PORT_TABLE
# TODO:Make sure doesnot breakt non-CMIS
#ptype = _wrapper_get_sfp_type(pport)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe get_sfp_type() returns the sfp_type present in the port and not the port cage type.
So can you rename the variable and uncomment the code?

Copy link
Contributor

@ds952811 ds952811 Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_sfp_type() returns only the port/cage type, and it's a constant initialized at chassis.init()
e.g.
https://github.com/Azure/sonic-buildimage/blob/master/platform/broadcom/sonic-platform-modules-dell/z9332f/sonic_platform/chassis.py#L132
And a QSFP28 could be attached to the QSFPDD port on the box, and it's better to skip the CMIS init sequence for it, hence please uncomment the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ds952811 form-factor of the transceiver is not the right check to determine whether the module follows CMIS or not. Even QSFP form-factor may follow CMIS sonic-net/sonic-platform-common#246 that follows CMIS

@@ -1150,6 +1182,14 @@ def task_worker(self):
self.CMIS_STATE_REMOVED]:
continue

# Handle the case when Xcvrd was NOT running when 'host_tx_ready' or 'admin_status'
# was updated or this is the first run so reconcile the above two attributes
if 'host_tx_ready' not in self.port_dict[lport]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will not get hit if the port is CMIS_STATE_READY, so with config-interface shutdown the event will be returned in the above if condition and the modules will not be put in DP-DEACTIVATE.

If any runtime Tx fault handler being implemented and the host_tx_ready is set to false, then the event will be missed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMIS state reset happens if any port configuration is changed including admin or host_tx_ready status here :https://github.com/Azure/sonic-platform-daemons/blob/master/sonic-xcvrd/xcvrd/xcvrd.py#L919

def task_worker(self):
self.log_notice("Starting...")
self.log_notice("Starting CmisTaskMgr with pid:{}".format(self.task_process.pid))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you checked the syslog of pmon#xcvrd, the pid is available over there, on the other hand, we already have the 'CMIS' prefix in the CmisManagerTask.log_notice(), hence this is not necessary


# APPL_DB for CONFIG updates, and STATE_DB for insertion/removal
sel, asic_context = port_mapping.subscribe_port_update_event(['APPL_DB', 'STATE_DB'])
sel, asic_context = port_mapping.subscribe_port_update_event()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While APPL_DB is for monitoring config updates arrived at swss#orchagent, the STATE_DB is for the transceiver insertion/removal detection, and it's not necessary to have both listeners registered all the time.
May I have a clarification on why we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ds952811 A new signal "host_tx_ready" is introduce for 400G QSFP-DD transceiver datapath activation as per this HLD where the OA will control this signal and Xcvrd will listen to any change of this signal in STATE_DB.

@@ -1284,6 +1327,13 @@ def task_worker(self):
try:
# CMIS state transitions
if state == self.CMIS_STATE_INSERTED:
if self.port_dict[lport]['host_tx_ready'] != 'True' or \
Copy link

@jaganbal-a jaganbal-a May 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change it 'true' and 'false' for host_tx_ready handling through out the file as we discussed.
Also change it in test_xcvrd.py

if port_change_event.port_dict is not None and 'host_tx_ready' in port_change_event.port_dict:
self.port_dict[lport]['host_tx_ready'] = port_change_event.port_dict['host_tx_ready']
if port_change_event.port_dict is not None and 'admin_status' in port_change_event.port_dict:
self.port_dict[lport]['admin_status'] = port_change_event.port_dict['admin_status']
self.port_dict[lport]['cmis_state'] = self.CMIS_STATE_INSERTED

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the CMIS state to inserted and invoke reset_cmis() only when there is a change in value for the following fields speed / lanes / host_tx_ready or admin_status.
With current change, the cmis state is set to INSERTED even for oper_state state change in value in the DB for a Port.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the port change event doesn't have a mechanism to filter out the event based upon the corresponding PORT_TABLE. So, until we have this mechanism to filter out, we need this code here otherwise Xcvrd will miss optics jack out/in

if port_change_event.port_dict is not None and 'host_tx_ready' in port_change_event.port_dict:
self.port_dict[lport]['host_tx_ready'] = port_change_event.port_dict['host_tx_ready']
if port_change_event.port_dict is not None and 'admin_status' in port_change_event.port_dict:
self.port_dict[lport]['admin_status'] = port_change_event.port_dict['admin_status']
self.port_dict[lport]['cmis_state'] = self.CMIS_STATE_INSERTED

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is redundant, reset_cmis() is doing this.

sonic-xcvrd/xcvrd/xcvrd.py Show resolved Hide resolved
@rlhui rlhui added the Chassis label Jun 10, 2022
if 'admin_status' in port_change_event.port_dict and 'oper_status' in port_change_event.dict:
# At times 'admin_status' is NOT the same in the PORT_TABLE of APPL_DB and STATE_DB
# We dont have better way to check if 'admin_status' is from APPL_DB or STATE_DB so this
# check is put temporarily to list only to APPL_DB's admin_status and ignore that of STATE_DB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of 'list only' it should be 'listen only'

@@ -994,7 +994,10 @@ def on_port_update_event(self, port_change_event):
self.port_dict[lport]['lanes'] = port_change_event.port_dict['lanes']
if 'host_tx_ready' in port_change_event.port_dict:
self.port_dict[lport]['host_tx_ready'] = port_change_event.port_dict['host_tx_ready']
if 'admin_status' in port_change_event.port_dict:
if 'admin_status' in port_change_event.port_dict and 'oper_status' in port_change_event.dict:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in port_change_event.dict: -> in port_change_event.port_dict:

Copy link

@shyam77git shyam77git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jaganbal-a
Copy link

if port_change_event.event_type == port_change_event.PORT_SET:
if pport >= 0:
self.port_dict[lport]['index'] = pport
if port_change_event.port_dict is not None and 'speed' in port_change_event.port_dict:
if 'speed' in port_change_event.port_dict and port_change_event.port_dict['speed'] != 'N/A':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a trace for each event that is processed.

@prgeor prgeor merged commit 7c0be24 into sonic-net:master Jun 28, 2022
yxieca pushed a commit that referenced this pull request Jul 28, 2022
* Improved 400G link bring up sequence

* Event based handling

* Remove unused functions

* Force DP to remain in DpInitialized State state on admin shutdown

* Added test case

* Skip CMIS task manager based upon flag

* Addressed review comments

* Fix xcvrd crash

* Fix test failure

* Listen only to 'APPL_DB's admin_status

* Fix typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants