Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dynamic buffer calculation][Mellanox] Enhance the logic to identify buffer pools and profiles #2498

Merged
merged 2 commits into from
Nov 2, 2022

Conversation

stephenxs
Copy link
Collaborator

@stephenxs stephenxs commented Oct 27, 2022

What I did

Originally, it was assumed the names of all the buffer pools follow the community buffer pool name convention ({ingress|egress}_{lossless|lossy}_pool). The heuristic algorithm to identify buffer pools and profiles was designed based on the assumption.
However, some users can define the buffer pool names in other ways, which breaks the logic in the Lua plugin and introduces degradation,

  • the pool sizes of those pools can not be generated
  • the additional reserved memory for lossy PG can not be calculated

In this PR, the logic is improved to tolerate the case.
Signed-off-by: Stephen Sun [email protected]

Why I did it

How I verified it

Manually and regression test.
It has been covered by regression test and vs test. No new test case is required.

Details if related

  1. Separate the buffer pools into two tables according to the direction.
  2. Iterate all the profiles, generating and recording the type (lossless/lossy) for each ingress profile which is identified by checking the pool it references
  3. Identify buffer profiles for lossy PG by checking the type (lossless/lossy) generated in 2 instead of the hardcoded name ingress_lossy_profile

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs stephenxs changed the title [Dynamic buffer calculation][Mellanox] Enhance the heuristic algorithm to identify lossy profiles [Dynamic buffer calculation][Mellanox] Enhance the logic to identify lossy profiles Oct 28, 2022
@stephenxs stephenxs changed the title [Dynamic buffer calculation][Mellanox] Enhance the logic to identify lossy profiles [Dynamic buffer calculation][Mellanox] Enhance the logic to identify buffer pools and profiles if their name do not follow the convention {ingress|egress}_{lossless|lossy}_pool Oct 28, 2022
@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs stephenxs marked this pull request as ready for review October 28, 2022 08:05
@stephenxs stephenxs requested a review from neethajohn as a code owner October 28, 2022 08:05
@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Collaborator Author

Failed by the following issue which is not relevant to the PR.
I'll not retrigger it for now until the issue has been fixed.

2022-10-30T13:18:11.5274265Z =================================== FAILURES ===================================
2022-10-30T13:18:11.5279148Z _______________________ TestVlan.test_VlanMemberLinkDown _______________________
2022-10-30T13:18:11.5279607Z 
2022-10-30T13:18:11.5280393Z self = <test_vlan.TestVlan object at 0x7fec66a8ed00>
2022-10-30T13:18:11.5281267Z dvs = <conftest.DockerVirtualSwitch object at 0x7fec66cdcd30>
2022-10-30T13:18:11.5281638Z 
2022-10-30T13:18:11.5282799Z     def test_VlanMemberLinkDown(self, dvs):
2022-10-30T13:18:11.5283336Z     
2022-10-30T13:18:11.5284227Z         # TODO: add_ip_address has a dependency on cdb within dvs,
2022-10-30T13:18:11.5285660Z         # so we still need to setup the db. This should be refactored.
2022-10-30T13:18:11.5286232Z         dvs.setup_db()
2022-10-30T13:18:11.5286606Z     
2022-10-30T13:18:11.5286987Z         vlan = "1000"
2022-10-30T13:18:11.5287419Z         vlan_ip = "192.168.0.1/21"
2022-10-30T13:18:11.5287869Z         interface = "Ethernet0"
2022-10-30T13:18:11.5288337Z         vlan_interface = "Vlan%s" % vlan
2022-10-30T13:18:11.5288802Z         server_ip = "192.168.0.100"
2022-10-30T13:18:11.5296684Z         vlan_intf_sysctl_param_path = "/proc/sys/net/ipv4/conf/%s/arp_evict_nocarrier" % vlan_interface
2022-10-30T13:18:11.5297546Z     
2022-10-30T13:18:11.5297994Z         self.dvs_vlan.create_vlan(vlan)
2022-10-30T13:18:11.5298550Z         vlan_oid = self.dvs_vlan.get_and_verify_vlan_ids(1)[0]
2022-10-30T13:18:11.5299325Z >       self.dvs_vlan.verify_vlan(vlan_oid, vlan)
2022-10-30T13:18:11.5299627Z 
2022-10-30T13:18:11.5300047Z test_vlan.py:508: 
2022-10-30T13:18:11.5300567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2022-10-30T13:18:11.5301057Z 
2022-10-30T13:18:11.5301525Z self = <dvslib.dvs_vlan.DVSVlan object at 0x7fec66d65700>
2022-10-30T13:18:11.5302924Z vlan_oid = 'oid:0x2600000000064e', vlan_id = '1000'
2022-10-30T13:18:11.5304960Z 
2022-10-30T13:18:11.5305439Z     def verify_vlan(self, vlan_oid, vlan_id):
2022-10-30T13:18:11.5306033Z         vlan = self.asic_db.wait_for_entry("ASIC_STATE:SAI_OBJECT_TYPE_VLAN", vlan_oid)
2022-10-30T13:18:11.5313076Z >       assert vlan.get("SAI_VLAN_ATTR_VLAN_ID") == vlan_id
2022-10-30T13:18:11.5313635Z E       AssertionError
2022-10-30T13:18:11.5313748Z 
2022-10-30T13:18:11.5314149Z dvslib/dvs_vlan.py:64: AssertionError
2022-10-30T13:18:11.5315329Z ---- generated xml file: /agent/_work/1/s/tests/test_port_dpb_system_tr.xml ----
2022-10-30T13:18:11.5315901Z ===Flaky Test Report===

@stephenxs stephenxs changed the title [Dynamic buffer calculation][Mellanox] Enhance the logic to identify buffer pools and profiles if their name do not follow the convention {ingress|egress}_{lossless|lossy}_pool [Dynamic buffer calculation][Mellanox] Enhance the logic to identify buffer pools and profiles Oct 31, 2022
@stephenxs
Copy link
Collaborator Author

Failed by the following issue which is not relevant to the PR. I'll not retrigger it for now until the issue has been fixed.

2022-10-30T13:18:11.5274265Z =================================== FAILURES ===================================
2022-10-30T13:18:11.5279148Z _______________________ TestVlan.test_VlanMemberLinkDown _______________________
2022-10-30T13:18:11.5279607Z 
2022-10-30T13:18:11.5280393Z self = <test_vlan.TestVlan object at 0x7fec66a8ed00>
2022-10-30T13:18:11.5281267Z dvs = <conftest.DockerVirtualSwitch object at 0x7fec66cdcd30>
2022-10-30T13:18:11.5281638Z 
2022-10-30T13:18:11.5282799Z     def test_VlanMemberLinkDown(self, dvs):
2022-10-30T13:18:11.5283336Z     
2022-10-30T13:18:11.5284227Z         # TODO: add_ip_address has a dependency on cdb within dvs,
2022-10-30T13:18:11.5285660Z         # so we still need to setup the db. This should be refactored.
2022-10-30T13:18:11.5286232Z         dvs.setup_db()
2022-10-30T13:18:11.5286606Z     
2022-10-30T13:18:11.5286987Z         vlan = "1000"
2022-10-30T13:18:11.5287419Z         vlan_ip = "192.168.0.1/21"
2022-10-30T13:18:11.5287869Z         interface = "Ethernet0"
2022-10-30T13:18:11.5288337Z         vlan_interface = "Vlan%s" % vlan
2022-10-30T13:18:11.5288802Z         server_ip = "192.168.0.100"
2022-10-30T13:18:11.5296684Z         vlan_intf_sysctl_param_path = "/proc/sys/net/ipv4/conf/%s/arp_evict_nocarrier" % vlan_interface
2022-10-30T13:18:11.5297546Z     
2022-10-30T13:18:11.5297994Z         self.dvs_vlan.create_vlan(vlan)
2022-10-30T13:18:11.5298550Z         vlan_oid = self.dvs_vlan.get_and_verify_vlan_ids(1)[0]
2022-10-30T13:18:11.5299325Z >       self.dvs_vlan.verify_vlan(vlan_oid, vlan)
2022-10-30T13:18:11.5299627Z 
2022-10-30T13:18:11.5300047Z test_vlan.py:508: 
2022-10-30T13:18:11.5300567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2022-10-30T13:18:11.5301057Z 
2022-10-30T13:18:11.5301525Z self = <dvslib.dvs_vlan.DVSVlan object at 0x7fec66d65700>
2022-10-30T13:18:11.5302924Z vlan_oid = 'oid:0x2600000000064e', vlan_id = '1000'
2022-10-30T13:18:11.5304960Z 
2022-10-30T13:18:11.5305439Z     def verify_vlan(self, vlan_oid, vlan_id):
2022-10-30T13:18:11.5306033Z         vlan = self.asic_db.wait_for_entry("ASIC_STATE:SAI_OBJECT_TYPE_VLAN", vlan_oid)
2022-10-30T13:18:11.5313076Z >       assert vlan.get("SAI_VLAN_ATTR_VLAN_ID") == vlan_id
2022-10-30T13:18:11.5313635Z E       AssertionError
2022-10-30T13:18:11.5313748Z 
2022-10-30T13:18:11.5314149Z dvslib/dvs_vlan.py:64: AssertionError
2022-10-30T13:18:11.5315329Z ---- generated xml file: /agent/_work/1/s/tests/test_port_dpb_system_tr.xml ----
2022-10-30T13:18:11.5315901Z ===Flaky Test Report===

I see the following error message from the vs switch

Oct 31 13:51:36.823477 c8cdbe86a896 NOTICE #syncd: :- asyncOnLinkMsg: received RTM_NEWLINK ifname: Vlan1000, ifflags: 0x1003, ifindex: 91
Oct 31 13:51:36.523757 c8cdbe86a896 ERR #orchagent: message repeated 6 times: [ :- removeVlan: Failed to remove ref count 1 VLAN Vlan2]
Oct 31 13:51:36.830128 c8cdbe86a896 NOTICE #orchagent: :- addVlan: Create an empty VLAN Vlan1000 vid:1000 vlan_oid:26000000000650
Oct 31 13:51:36.830167 c8cdbe86a896 ERR #orchagent: :- removeVlan: Failed to remove ref count 1 VLAN Vlan2

In the test case, there is no logic to remove the vlan interface. Looks like it should be constantly failing. I was just curious how the test got passed previously...

    def test_VlanProxyArp(self, dvs):

        def proxy_arp_enabled():
            rc, proxy_arp_res = dvs.runcmd("cat /proc/sys/net/ipv4/conf/Vlan{}/proxy_arp".format(vlan))
            rc, pvlan_res = dvs.runcmd("cat /proc/sys/net/ipv4/conf/Vlan{}/proxy_arp_pvlan".format(vlan))

            return (proxy_arp_res.strip("\n") == "1" and pvlan_res.strip("\n") == "1", (proxy_arp_res, pvlan_res))

        def proxy_arp_disabled():
            rc, proxy_arp_res = dvs.runcmd("cat /proc/sys/net/ipv4/conf/Vlan{}/proxy_arp".format(vlan))
            rc, pvlan_res = dvs.runcmd("cat /proc/sys/net/ipv4/conf/Vlan{}/proxy_arp_pvlan".format(vlan))

            return (proxy_arp_res.strip("\n") == "0" and pvlan_res.strip("\n") == "0", (proxy_arp_res, pvlan_res))

        vlan = "2"
        self.dvs_vlan.create_vlan(vlan)
        self.dvs_vlan.create_vlan_interface(vlan)
        self.dvs_vlan.set_vlan_intf_property(vlan, "proxy_arp", "enabled")

        wait_for_result(proxy_arp_enabled, PollingConfig(), 'IPv4 proxy_arp or proxy_arp_pvlan not enabled')

        self.dvs_vlan.set_vlan_intf_property(vlan, "proxy_arp", "disabled")

        wait_for_result(proxy_arp_disabled, PollingConfig(), 'IPv4 proxy_arp or proxy_arp_pvlan not disabled')

        self.dvs_vlan.remove_vlan(vlan)
        self.dvs_vlan.get_and_verify_vlan_ids(0)

@stephenxs stephenxs requested a review from prsunny as a code owner October 31, 2022 14:41
@stephenxs stephenxs force-pushed the handle-irregular-pool-name branch from b5011e3 to e0ab764 Compare November 1, 2022 01:54
@stephenxs stephenxs requested a review from neethajohn November 1, 2022 01:56
@stephenxs
Copy link
Collaborator Author

another error relevant to WR

2022-11-01T04:48:28.4395128Z =================================== FAILURES ===================================
2022-11-01T04:48:28.4401258Z _____________________ TestPortAutoNeg.test_PortAutoNegWarm _____________________
2022-11-01T04:48:28.4403412Z 
2022-11-01T04:48:28.4406605Z self = <test_port_an.TestPortAutoNeg object at 0x7fc1cb3455b0>
2022-11-01T04:48:28.4412118Z dvs = <conftest.DockerVirtualSwitch object at 0x7fc1cae48cd0>
2022-11-01T04:48:28.4412870Z testlog = <function testlog at 0x7fc1cb5c7790>
2022-11-01T04:48:28.4413235Z 
2022-11-01T04:48:28.4413820Z     def test_PortAutoNegWarm(self, dvs, testlog):
2022-11-01T04:48:28.4414315Z     
2022-11-01T04:48:28.4415168Z         db = swsscommon.DBConnector(0, dvs.redis_sock, 0)
2022-11-01T04:48:28.4417473Z         cdb = swsscommon.DBConnector(4, dvs.redis_sock, 0)
2022-11-01T04:48:28.4418126Z         sdb = swsscommon.DBConnector(6, dvs.redis_sock, 0)
2022-11-01T04:48:28.4418476Z     
2022-11-01T04:48:28.4420376Z         tbl = swsscommon.ProducerStateTable(db, "PORT_TABLE")
2022-11-01T04:48:28.4421195Z         ctbl = swsscommon.Table(cdb, "PORT")
2022-11-01T04:48:28.4423112Z         stbl = swsscommon.Table(sdb, "PORT_TABLE")
2022-11-01T04:48:28.4423428Z     
2022-11-01T04:48:28.4425755Z         # set autoneg = true and speed = 1000
2022-11-01T04:48:28.4427725Z         fvs = swsscommon.FieldValuePairs([("autoneg","on"),
2022-11-01T04:48:28.4429855Z                                           ("adv_speeds", "100,1000"),
2022-11-01T04:48:28.4431956Z                                           ("adv_interface_types", "CR2,CR4")])
2022-11-01T04:48:28.4434307Z         ctbl.set("Ethernet0", fvs)
2022-11-01T04:48:28.4434588Z     
2022-11-01T04:48:28.4434955Z         time.sleep(1)
2022-11-01T04:48:28.4435296Z     
2022-11-01T04:48:28.4437567Z         adb = swsscommon.DBConnector(1, dvs.redis_sock, 0)
2022-11-01T04:48:28.4437905Z     
2022-11-01T04:48:28.4439658Z         atbl = swsscommon.Table(adb, "ASIC_STATE:SAI_OBJECT_TYPE_PORT")
2022-11-01T04:48:28.4441600Z         (status, fvs) = atbl.get(dvs.asicdb.portnamemap["Ethernet0"])
2022-11-01T04:48:28.4442123Z         assert status == True
2022-11-01T04:48:28.4443764Z     
2022-11-01T04:48:28.4445469Z         assert "SAI_PORT_ATTR_AUTO_NEG_MODE" in [fv[0] for fv in fvs]
2022-11-01T04:48:28.4447314Z         assert "SAI_PORT_ATTR_ADVERTISED_SPEED" in [fv[0] for fv in fvs]
2022-11-01T04:48:28.4449778Z         assert "SAI_PORT_ATTR_ADVERTISED_INTERFACE_TYPE" in [fv[0] for fv in fvs]
2022-11-01T04:48:28.4450390Z         for fv in fvs:
2022-11-01T04:48:28.4451821Z             if fv[0] == "SAI_PORT_ATTR_AUTO_NEG_MODE":
2022-11-01T04:48:28.4453678Z                 assert fv[1] == "true"
2022-11-01T04:48:28.4457258Z             elif fv[0] == "SAI_PORT_ATTR_ADVERTISED_SPEED":
2022-11-01T04:48:28.4457698Z                 assert fv[1] == "2:100,1000"
2022-11-01T04:48:28.4459932Z             elif fv[0] == "SAI_PORT_ATTR_ADVERTISED_INTERFACE_TYPE":
2022-11-01T04:48:28.4462172Z                 assert fv[1] == "2:SAI_PORT_INTERFACE_TYPE_CR2,SAI_PORT_INTERFACE_TYPE_CR4"
2022-11-01T04:48:28.4462705Z     
2022-11-01T04:48:28.4462990Z         # set admin up
2022-11-01T04:48:28.4465041Z         cfvs = swsscommon.FieldValuePairs([("admin_status", "up")])
2022-11-01T04:48:28.4467367Z         ctbl.set("Ethernet0", cfvs)
2022-11-01T04:48:28.4467648Z     
2022-11-01T04:48:28.4467976Z     
2022-11-01T04:48:28.4468344Z         dvs.warm_restart_swss("true")
2022-11-01T04:48:28.4468713Z     
2022-11-01T04:48:28.4470891Z         # freeze orchagent for warm restart
2022-11-01T04:48:28.4473306Z         (exitcode, result) = dvs.runcmd("/usr/bin/orchagent_restart_check")
2022-11-01T04:48:28.4475602Z >       assert result == "RESTARTCHECK succeeded\n"
2022-11-01T04:48:28.4478953Z E       AssertionError: assert 'RESTARTCHECK...unction 984\n' == 'RESTARTCHECK succeeded\n'
2022-11-01T04:48:28.4481323Z E         + RESTARTCHECK succeeded
2022-11-01T04:48:28.4483705Z E         - RESTARTCHECK failed
2022-11-01T04:48:28.4486535Z E         - profiling:/__w/1/s/common/.libs/libswsscommon_la-events.gcda:Merge mismatch for function 984
2022-11-01T04:48:28.4486900Z 
2022-11-01T04:48:28.4590728Z test_port_an.py:262: AssertionError
2022-11-01T04:48:28.4591777Z ----------------------------- Captured stdout call -----------------------------

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@stephenxs
Copy link
Collaborator Author

Test issue fixed by #2504

@stephenxs
Copy link
Collaborator Author

stephenxs commented Nov 2, 2022

Again test_PortAutoNegWarm failed. regriggering. Not reproduced locally

@stephenxs
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@neethajohn neethajohn merged commit 84642f3 into sonic-net:master Nov 2, 2022
@stephenxs stephenxs deleted the handle-irregular-pool-name branch November 2, 2022 23:55
yxieca pushed a commit that referenced this pull request Nov 3, 2022
…buffer pools and profiles (#2498)

Signed-off-by: Stephen Sun <[email protected]>

What I did
Originally, it was assumed the names of all the buffer pools follow the community buffer pool name convention ({ingress|egress}_{lossless|lossy}_pool). The heuristic algorithm to identify buffer pools and profiles was designed based on the assumption.
However, some users can define the buffer pool names in other ways, which breaks the logic in the Lua plugin and introduces degradation,

the pool sizes of those pools can not be generated
the additional reserved memory for lossy PG can not be calculated
In this PR, the logic is improved to tolerate the case.
Signed-off-by: Stephen Sun [email protected]

How I verified it
Manually and regression test.
It has been covered by regression test and vs test. No new test case is required.

Details if related
Separate the buffer pools into two tables according to the direction.
Iterate all the profiles, generating and recording the type (lossless/lossy) for each ingress profile which is identified by checking the pool it references
Identify buffer profiles for lossy PG by checking the type (lossless/lossy) generated in 2 instead of the hardcoded name ingress_lossy_profile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants