-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fabricportsorch] Add fabric support #1459
Conversation
This pull request introduces 1 alert and fixes 1 when merging 6162bbe into e4dfb37 - view on LGTM.com new alerts:
fixed alerts:
|
|
||
if (attr.value.booldata) | ||
{ | ||
attr.id = SAI_PORT_ATTR_FABRIC_ATTACHED_SWITCH_ID; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the crc error collection here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we detect if a link flapped between the polling period? Is that a count that can be maintained in STATE_DB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add counter list, one counter for one port, that counts the number of times the link flaps. I also maintain the last time the link flaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CRC error collection will be added in another PR. It's because sonic-sairedis now doesn't understand attribute type which is used for getting fabric port error list. So we will need to modify sonic-sairedis, and then add CRC in sonic-swss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Sounds good to me.
@minionatwork (Suresh) - please add as reviewer. |
wonder why this pr does not have swss vs test cases added? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add swss vs tests.
Sure, I will look into that. |
This pull request introduces 1 alert and fixes 1 when merging 4770917 into b9084a7 - view on LGTM.com new alerts:
fixed alerts:
|
This pull request introduces 1 alert and fixes 1 when merging 91df50a into 189a964 - view on LGTM.com new alerts:
fixed alerts:
|
Hello @lguohan: vstest for fabric ports has been updated now. There are PRs in The above PRs merged, but their code doesn't get invoked until sonic-net/sonic-buildimage#6185 merges. That is to set up fabric lane mapping, and so to create fabric ports. Because if we merge it, the following tests will fail:
Why is that? Because now
In this PR's So one possible solution is that in this PR, I won't update the expected number with fabric ports (keep 32 only for front panel ones). So the tests won't fail. And we will merge it up first. Once it merges, we disable the tests, merge sonic-net/sonic-buildimage#6185, and then make another PR to reenable the tests. sonic-net/sonic-buildimage#6185 is fairly simple, so it won't take much down time for the tests. Please let me know what you think? |
can you add design document link to this PR in description |
}; | ||
|
||
static const vector<sai_queue_stat_t> queue_stat_ids = | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these valid for Fabric ports ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, those are valid for fabric ports. But currently we are not able to query those stats yet until jupiter-sai completely supports them. I keep them here though.
Those stats will be setup in generateQueueStats(), which is currently empty. For reference purpose, generateQueueStats() will do the below thing.
For each port:
// Maintain queue map and install flex counters for queue stats
vector<FieldValueTuple> laneQueueMap;
for (size_t q = 0; q < m_queue_ids.size(); q ++)
{
std::ostringstream name;
name << "FabricPort" << lane << ":" << q;
laneQueueMap.emplace_back(name.str(), sai_serialize_object_id(m_queue_ids[q]));
std::unordered_set<string> counter_stats;
for (const auto& it: queue_stat_ids)
{
counter_stats.emplace(sai_serialize_queue_stat(it));
}
queue_stat_manager.setCounterIdList(m_queue_ids[q], CounterType::QUEUE, counter_stats);
}
m_laneQueueCounterTable->set("", laneQueueMap);
I'll update generateQueueStats() to use queue_stat_ids
once we have fully support from jupiter-sai.
Orch(appl_db, tableNames), | ||
port_stat_manager(FABRIC_PORT_STAT_COUNTER_FLEX_COUNTER_GROUP, StatsMode::READ, | ||
FABRIC_PORT_STAT_FLEX_COUNTER_POLLING_INTERVAL_MS, true), | ||
queue_stat_manager(FABRIC_QUEUE_STAT_COUNTER_FLEX_COUNTER_GROUP, StatsMode::READ, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume this is for queue_stat_ids ? So again is this needed for fabric port.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, once jupiter-sai supports queue stats query, queue_stat_manager
is used to set up queue_stat_ids
in FLEX_COUNTER_DB
. syncd
periodically does its job by reading those counters and updating COUNTERS_DB
.
m_stateTable = unique_ptr<Table>(new Table(m_state_db.get(), FABRIC_PORT_TABLE)); | ||
|
||
m_counter_db = shared_ptr<DBConnector>(new DBConnector("COUNTERS_DB", 0)); | ||
m_laneQueueCounterTable = unique_ptr<Table>(new Table(m_counter_db.get(), COUNTERS_QUEUE_NAME_MAP)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does COUNTERS_QUEUE_NAME_MAP and COUNTERS_QUEUE_PORT_MAP looks like for fabric port ?
Can you add comment in the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once jupiter-sai supports port and queue stats completely for fabric ports on both fabric and voq asic, this will look like below in COUNTERS_DB
:
127.0.0.1:6379[2]> select 2
OK
127.0.0.1:6379[2]> HGETALL "COUNTERS_QUEUE_PORT_MAP"
1) "FabricPort0"
2) "oid:0x401000000000006"
127.0.0.1:6379[2]> HGETALL "COUNTERS:oid:0x401000000000006"
1) "SAI_PORT_STAT_IF_OUT_FABRIC_DATA_UNITS"
2) "0"
3) "SAI_PORT_STAT_IF_IN_ERRORS"
4) "0"
5) "SAI_PORT_STAT_IF_IN_OCTETS"
6) "0"
7) "SAI_PORT_STAT_IF_IN_FABRIC_DATA_UNITS"
8) "0"
9) "SAI_PORT_STAT_IF_IN_FEC_SYMBOL_ERRORS"
10) "0"
11) "SAI_PORT_STAT_IF_OUT_OCTETS"
12) "0"
13) "SAI_PORT_STAT_IF_IN_FEC_CORRECTABLE_FRAMES"
14) "0"
15) "SAI_PORT_STAT_IF_IN_FEC_NOT_CORRECTABLE_FRAMES"
16) "0"
127.0.0.1:6379[2]> HGETALL "COUNTERS_QUEUE_NAME_MAP"
1) "FabricPort0:0"
2) "oid:0x4150000000000c6"
127.0.0.1:6379[2]> HGETALL "COUNTERS:oid:0x4150000000000c6"
1) "SAI_QUEUE_STAT_CURR_OCCUPANCY_LEVEL"
2) "0"
3) "SAI_QUEUE_STAT_WATERMARK_LEVEL"
4) "24"
5) "SAI_QUEUE_STAT_CURR_OCCUPANCY_BYTES"
6) "0"
@@ -49,6 +50,7 @@ OrchDaemon::OrchDaemon(DBConnector *applDb, DBConnector *configDb, DBConnector * | |||
m_chassisAppDb(chassisAppDb) | |||
{ | |||
SWSS_LOG_ENTER(); | |||
m_select = new Select(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we doing here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are we doing here ?
ok got it. we have moved to base class for select object initialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had to move it to base class since it's used for both npu and fabric.
@@ -68,6 +70,7 @@ OrchDaemon::~OrchDaemon() | |||
for(; it != m_orchList.rend(); ++it) { | |||
delete(*it); | |||
} | |||
delete m_select; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as previous what is this change for ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok got it. we have moved to base class for select object initialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I had to move it to base class since it's used for npu and fabric too.
|
||
if (m_getFabricPortListDone) | ||
{ | ||
updateFabricPortState(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add link to the PR description where these State DB fields are define. Also looks like current VS test case added as part of this do not cover state db changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I updated the description with the link.
The VS test already verifies the number of fabric ports in STATE_DB
which is 16 fabric ports created for voq asic. Please look at the test part:
# Verify that fabric ports are monitored in STATE_DB
if metadata.get('switch_type', 'npu') in ['voq', 'fabric']:
self.get_state_db()
self.state_db.wait_for_n_keys("FABRIC_PORT_TABLE", 16)
Below is what we see from a container of a linecard created by a vs test.
root@a5959e6e0a4a:/# sonic-db-cli STATE_DB HGETALL 'FABRIC_PORT_TABLE|PORT10'
{'STATUS': 'down'}
In addition to the change in the test that verifies STATE_DB
, sonic-net/sonic-sairedis#737 merged to vslib to create fabric ports for voq asics.
Note that we don't test fabric asics because it will require significant changes in the tests - all tests now assume testing asic is NPU with front panel ports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Have some comments for clarification.
Hello @abdosi, could you please check my responses, and see if you can approve? One question is actually here that I will need to comment out 5 failing tests. Are you OK with that? The reason is that there are PRs in sonic-sairedis which merged to add support fabric ports in vslib - sonic-net/sonic-sairedis#737 and sonic-net/sonic-sairedis#769. The above PRs merged, but their code doesn't get invoked until sonic-net/sonic-buildimage#6185 merges. That is to set up fabric lane mapping, and so to create fabric ports. Because if we merge it, the following tests will fail: test_virtual_chassis.TestVirtualChassis.test_connectivity Why is that? Because now vslib is fed with fabric lane mapping, it will create 16 fabric ports for VOQ asic. Together with 32 front panel ports, VOQ asic will have 32 + 16 = 48 ports (type SAI_OBJECT_TYPE_PORT). But the tests expect only 32 ports only. So they fail. conftest.py:490: in check_swss_ready In this PR's swss tests, I updated the expected number of ports with fabric ports. But since sonic-net/sonic-buildimage#6185 can't merge, this PR also has the above test failures (in the opposite way: it expects 48 ports, but only 32 ports created - no fabric). So the solution is I will disable the above tests, merge this PR, and then merge sonic-net/sonic-buildimage#6185, and then reenable the tests. Does it sound good to you? If yes, I will add change to disable the tests in this PR. |
retest vs please |
This pull request introduces 1 alert and fixes 1 when merging e368411 into cba6576 - view on LGTM.com new alerts:
fixed alerts:
|
OK, so with the latest commit, I just disabled checking fabric port in vs test, instead of disabling all failing tests. Voq code is still under testing. When sonic-net/sonic-buildimage#6185 merges, I will just make another PR that changes |
This pull request introduces 1 alert and fixes 1 when merging cbff2cb into ee7a735 - view on LGTM.com new alerts:
fixed alerts:
|
/azpw run |
Signed-off-by: ngocdo <[email protected]>
This pull request introduces 1 alert and fixes 1 when merging 03d9e26 into ee7a735 - view on LGTM.com new alerts:
fixed alerts:
|
Hi @lguohan - could you please review again, comments taken care off. Thanks |
This pull request introduces 1 alert and fixes 1 when merging f054f8b into e29d566 - view on LGTM.com new alerts:
fixed alerts:
|
retest vs please |
Hi @abdosi - could you please initiate azp run. |
This pull request introduces 1 alert and fixes 1 when merging 38ddc7a into 73ffd5f - view on LGTM.com new alerts:
fixed alerts:
|
@lguohan I just resolved conflict one more time, and the tests got run again. All passed. Could you please approach it again? I think it's great if we can merge this PR asap. This PR recently keeps getting conflicts (twice last week). |
have you done this or will do ? |
I will do this. Thanks for merging it up. |
…7629) This PR is actually #6185 that was merged and then reverted to avoid test failure in swss. Now sonic-net/sonic-swss#1459 merged, and so this PR could merge to enable fabric test in swss. Signed-off-by: ngocdo <[email protected]>
…onic-net#7629) This PR is actually sonic-net#6185 that was merged and then reverted to avoid test failure in swss. Now sonic-net/sonic-swss#1459 merged, and so this PR could merge to enable fabric test in swss. Signed-off-by: ngocdo <[email protected]>
This code is to add support for fabric asics and NPU with fabric ports enabled. What I did Create FabricOrchDaemon for fabric asics, instead of using OrchDaemon which is used for NPU. Create FabricPortsOrch to manage fabric ports. It collects information about port state, peer switch id and peer lane (stored in STATE_DB), sets up port stats and queue states. In future, it will also be used to enable/disable erroneous fabric ports. Fabric port and queue stats are setup to be collected via FlexCounters.
This code is to add support for fabric asics and NPU with fabric ports enabled.
What I did
Create FabricOrchDaemon for fabric asics, instead of using OrchDaemon which is used for NPU.
Create FabricPortsOrch to manage fabric ports. It collects information about port state, peer switch id and peer lane (stored in STATE_DB), sets up port stats and queue states. In future, it will also be used to enable/disable erroneous fabric ports.
Fabric port and queue stats are setup to be collected via FlexCounters.
One thing to note here: The code is waiting for #1431 to start fabric flow for fabric asic/NPU with enabled fabric ports. So right now, there is no code that creates objects for FabricOrchDaemon/FabricPortsOrch yet. We need to know asic type (and if asic needs fabric ports to be enabled) to do that. These information should come from orchagent/main.cpp, and will be available after #1431 merges.
Why I did it
Initialize fabric asics and support traffic across multiple NPUs.
How I verified it
Tested in systems with fabric asics (BCM 88790) and NPU with fabric ports (BCM 88690).
Details if related
This code will require to update a few other things:
(1) syncd: to ask FlexCounter to collect fabric counters (sonic-net/sonic-sairedis#669)
(2) docker-orchagent: disable orchagent-portsyncd dependency. In this pull request, after fabric asic is initialized (by SAI), all fabric ports will be enabled by default. So portsyncd is not used for fabric asic at all, and orchagent should not wait for portsyncd (sonic-net/sonic-buildimage#5569)
This pull request doesn't include:
The implementation is based on the design doc proposed in sonic-net/SONiC#668.