EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

raja-rajasekar · 2024-11-26T23:24:18Z

The following are cases where it is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up. Below are few triggers as example

Interface Up/Down events

a bulk of L2VNIs are flapped i.e. flap of br_default interface (or)
a bulk of L3VNIs are flapped i.e. flap of br_l3vni interface

Anytime BGP gets a L2 VNI ADD (or) L3 VNI ADD/DEL from zebra,

Walking the entire global routing table per L2VNI is very expensive.
The next read (say of another VNI ADD) from the socket does not proceed unless this walk is complete.

So for triggers where a bulk of L2VNI's/L3VNIs are flapped, this results in huge output buffer FIFO growth spiking up the memory in zebra since bgp is slow/busy processing the first message.

To avoid this, idea is to

hookup the VPN off the bgp_master struct and maintain a VPN FIFO list which is processed later on, where we walk a chunk of VPNs and do the remote route install.
hookup the BGP-VRF off the struct bgp_master and maintain a struct bgp FIFO list which is processed later on, where we walk a chunk of BGP-VRFs and do the remote route install/uninstall.

Note: So far in the L3 backpressure cases(#15524), we have considered the fact that zebra is slow, and the buffer grows in the BGP.

However, this is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up.

Currently, zebra-bgp communcication via zapi for L2/L3 VNI allocates more memory than required. So, on a system where BGP is slow/busy and zebra is quick, triggers such as ADD L2/L3 VNI can result in huge buffer growth in zebra thereby spiking up the memory because a VNI ADD/DEL operation includes - Expensive Walk of the entire global routing table per L2/L3 VNI. - The next read (say of another VNI ADD/DEL) from the socket does not proceed unless the current walk is complete. This bigger stream allocation accounts a portion to that memory spike. Fix is to reduce the stream allocation size to a reasonable value when zebra informs BGP about local EVPN L2/L3 VNI Addition or Deletion. Note: - Future commits will optimize the inline global routing table walk for triggers where bigger set of VNIs flap (Ex: br_default/br_vni flap). - Currently, focus is only on communication between zebra and bgp for L2/L3 VNI add/del. Need to evaluate this for other zapi msgs. Ticket :#3864372 Signed-off-by: Donald Sharp [email protected] Signed-off-by: Rajasekar Raja <[email protected]>

Adds a msg list for getting strings mapping to enum bgp_evpn_route_type Ticket: #3318830 Signed-off-by: Trey Aspelund <[email protected]>

- For L2vni, struct bgp_master holds a type safe list of all the VNIs(struct bgpevpn) that needs to be processed. - For L3vni, struct bgp_master holds a type safe list of all the BGP_VRFs(struct bgp) that needs to be processed. Future commits will use this. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>

raja-rajasekar · 2024-11-26T23:28:09Z

For "bgpd: Suppress redundant L3VNI delete processing"

Instrumented logs without fix: ifdown br_l3vni

    74 2024/11/26 22:28:00.324443 ZEBRA: [Z9WYD-4ERFV] RAJA DOWN zebra_vxlan_svi_down
    75 2024/11/26 22:28:00.324450 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4004 VRF vrf4 to bgp

 17507 2024/11/26 22:28:00.876116 ZEBRA: [NVFT0-HS1EX] INTF_INSTALL for vxlan99(147)
 17508 2024/11/26 22:28:00.876197 ZEBRA: [WJRZ7-WE7M5] RTM_NEWLINK update for vxlan99(147) sl_type 0 master 0
 17509 2024/11/26 22:28:00.876203 ZEBRA: [PPSYY-6KJJP] Intf vxlan99(147) PTM up, notifying clients
 17510 2024/11/26 22:28:00.876314 ZEBRA: [W7XYW-5FTP2] RAJA999 in if_up
….
 17986 2024/11/26 22:28:00.886309 ZEBRA: [Y4TDE-84YR0] Update L3-VNI 4004 intf vxlan99(147) VLAN 2668 local IP 2.1.1.6 master 0 chg 0x2
 17987 2024/11/26 22:28:00.886311 ZEBRA: [XEW89-KXF0P] RAJA-DOWN 1 zebra_vxlan_if_update_vni for vni 4004
 17988 2024/11/26 22:28:00.886312 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4004 VRF vrf4 to bgp

Instrumented logs without fix: ifup br_l3vni

 359 2024/11/26 22:29:25.427495 ZEBRA: [M37B1-HQHSP] RTM_NEWVLAN for ifindex 147 NS 0, enqueuing for zebra main

 362 2024/11/26 22:29:25.427423 ZEBRA: [K8FXY-V65ZJ] Intf dplane ctx 0x7fc4d0027e10, op INTF_INSTALL, ifindex (147), result QUEUED
 363 2024/11/26 22:29:25.427428 ZEBRA: [NVFT0-HS1EX] INTF_INSTALL for vxlan99(147)
 364 2024/11/26 22:29:25.427465 ZEBRA: [TQR2A-H2RFY] Vlan-Vni(671:671-4005:4005) update for VxLAN IF vxlan99(147)
 365 2024/11/26 22:29:25.427483 ZEBRA: [QZ4F6-8EX79] zebra_vxlan_if_add_update_vni vxlan vxlan99 vni (4005, 671) not present in bridge table
 366 2024/11/26 22:29:25.427486 ZEBRA: [PWSYZ-A537X] zebra_evpn_acc_vl_new access vlan 671 bridge br_l3vni add
 367 2024/11/26 22:29:25.427569 ZEBRA: [Y4TDE-84YR0] Update L3-VNI 4005 intf vxlan99(147) VLAN 671 local IP 2.1.1.6 master 670 chg 0x4
 368 2024/11/26 22:29:25.427573 ZEBRA: [Z0ADA-V8CT4] RAJA-DOWN 3 zebra_vxlan_if_update_vni
 369 2024/11/26 22:29:25.427580 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4005 VRF vrf5 to bgp
….
 386 2024/11/26 22:29:25.428382 ZEBRA: [WVRMN-YEC5Q] Del L3-VNI 4001 intf vxlan99(147)
 387 2024/11/26 22:29:25.428384 ZEBRA: [WDB17-CBPCZ] RAJA DOWNzebra_vxlan_if_del_vni
 388 2024/11/26 22:29:25.428387 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4001 VRF vrf1 to bgp

With Fix: ifdown br_l3vni

   668 2024/11/26 19:18:26.344063 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf5 VNI 4005
   669 2024/11/26 19:18:26.344069 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4005 is already deleted
   670 2024/11/26 19:18:26.344092 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf1 VNI 4001
   671 2024/11/26 19:18:26.344093 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4001 is already deleted
   672 2024/11/26 19:18:26.344114 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf4 VNI 4004
   673 2024/11/26 19:18:26.344115 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4004 is already deleted
   674 2024/11/26 19:18:26.344135 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf3 VNI 4003
   675 2024/11/26 19:18:26.344136 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4003 is already deleted
   676 2024/11/26 19:18:26.344157 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf2 VNI 4002
   677 2024/11/26 19:18:26.344158 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4002 is already deleted
….
   688 2024/11/26 19:18:26.344517 BGP: [XXJ7P-NWW2X] Rx L3VNI ADD VRF vrf3 VNI 4003 Originator-IP 2.1.1.6 RMAC svi-mac 1c:34:da:23:4f:fd vrr-mac 1c:34:da:23:4f:fd filter none svi-if 5517

With Fix: ifup br_l3vni

 8546 2024/11/26 19:26:23.400423 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf1 VNI 4001
  8547 2024/11/26 19:26:23.400435 BGP: [T0MP2-YRTMX] Scheduling L3VNI DEL to be processed later for VRF vrf1 VNI 4001
  8548 2024/11/26 19:26:23.402722 BGP: [GFHWV-99P7C] Rx Intf down VRF vrf1 IF vlan2501_l3
  8549 2024/11/26 19:26:23.404025 BGP: [G49HN-S8M77] Rx Intf address del VRF vrf1 IF vlan2501_l3 addr fe80::1e34:daff:fe23:4ffd/64
  8550 2024/11/26 19:26:23.404397 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf2 VNI 4002
  8551 2024/11/26 19:26:23.404401 BGP: [T0MP2-YRTMX] Scheduling L3VNI DEL to be processed later for VRF vrf2 VNI 4002

140642 2024/11/26 19:26:26.165399 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf1 VNI 4001
140643 2024/11/26 19:26:26.165410 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4001 is already deleted
…..
145672 2024/11/26 19:26:26.236828 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf4 VNI 4004
145673 2024/11/26 19:26:26.236836 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4004 is already deleted

ton31337

TBH, I don't understand the goal of 58e8563. Why not bundle together with the real changes?

ton31337 · 2024-11-27T06:37:45Z

bgpd/bgp_evpn.c

-			if (!(is_evpn_prefix_ipaddr_v4(evp)
-			      || is_evpn_prefix_ipaddr_v6(evp)))
+			/* Proceed only for MAC_IP/IP-Pfx routes */
+			switch (evp->prefix.route_type) {


If we move to switch, then make sure we cover all enum values (including BGP_EVPN_AD_ROUTE, etc.).

ton31337 · 2024-11-27T06:41:46Z

bgpd/bgp_evpn.c

+				bgp_evpn_local_l3vni_del_post_processing(bgp_to_proc);
+
+			UNSET_FLAG(bgp_to_proc->flags, BGP_FLAG_L3VNI_SCHEDULE_FOR_INSTALL);
+			UNSET_FLAG(bgp_to_proc->flags, BGP_FLAG_L3VNI_SCHEDULE_FOR_DELETE);


Shouldn't we do this inside bgp_evpn_local_l3vni_del_post_processing()?

ton31337 · 2024-11-27T06:46:55Z

bgpd/bgp_evpn.c

-			    evp->prefix.route_type != BGP_EVPN_MAC_IP_ROUTE)
+			/* Proceed only for IMET/AD/MAC_IP routes */
+			switch (evp->prefix.route_type) {
+			case BGP_EVPN_IMET_ROUTE:


Same as with L3VNIs, we must cover all enum values.

ton31337 · 2024-11-27T06:58:48Z

lib/prefix.h

@@ -26,6 +26,7 @@ extern "C" {

 /* EVPN route types. */
 typedef enum {
+	BGP_EVPN_UNKN_ROUTE = 0,


0 is defined as Reserved in RFC 7432, I think we should use something like -1? And in what case it can be unknown?

ton31337 · 2024-11-27T07:00:19Z

lib/zclient.h

@@ -47,6 +47,11 @@ typedef uint16_t zebra_size_t;
 #define ZEBRA_MAX_PACKET_SIZ          16384U
 #define ZEBRA_SMALL_PACKET_SIZE       200U

+/* Only for L2/L3 VNI Add/Del */
+#define ZEBRA_VNI_MAX_PACKET_SIZE   80U


How are these values derived?

Anytime BGP gets a L2 VNI ADD from zebra, - Walking the entire global routing table per L2VNI is very expensive. - The next read (say of another VNI ADD) from the socket does not proceed unless this walk is complete. So for triggers where a bulk of L2VNI's are flapped, this results in huge output buffer FIFO growth spiking up the memory in zebra since bgp is slow/busy processing the first message. To avoid this, idea is to hookup the VPN off the bgp_master struct and maintain a VPN FIFO list which is processed later on, where we walk a chunk of VPNs and do the remote route install. Note: So far in the L3 backpressure cases(FRRouting#15524), we have considered the fact that zebra is slow, and the buffer grows in the BGP. However this is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>

Anytime BGP gets a L3 VNI ADD/DEL from zebra, - Walking the entire global routing table per L3VNI is very expensive. - The next read (say of another VNI ADD/DEL) from the socket does not proceed unless this walk is complete. So for triggers where a bulk of L3VNI's are flapped, this results in huge output buffer FIFO growth spiking up the memory in zebra since bgp is slow/busy processing the first message. To avoid this, idea is to hookup the BGP-VRF off the struct bgp_master and maintain a struct bgp FIFO list which is processed later on, where we walk a chunk of BGP-VRFs and do the remote route install/uninstall. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>

Consider a master bridge interface (br_l3vni) having a slave vxlan99 mapped to vlans used by 3 L3VNIs. During ifdown br_l3vni interface, the function zebra_vxlan_process_l3vni_oper_down() where zebra sends ZAPI to bgp for a delete L3VNI is sent twice. 1) if_down -> zebra_vxlan_svi_down() 2) VXLAN is unlinked from the bridge i.e. vxlan99 zebra_if_dplane_ifp_handling() --> zebra_vxlan_if_update_vni() (since ZEBRA_VXLIF_MASTER_CHANGE flag is set) During ifup br_l3vni interface, the function zebra_vxlan_process_l3vni_oper_down() is invoked because of access-vlan change - process oper down, associate with new svi_if and then process oper up again The problem here is that the redundant ZAPI message of L3VNI delete results in BGP doing a inline Global table walk for remote route installation when the L3VNI is already removed/deleted. Bigger the scale, more wastage is the CPU utilization. Given the triggers for bridge flap is not a common scenario, idea is to simply return from BGP if the L3VNI is already set to 0 i.e. if the L3VNI is already deleted, do nothing and return. NOTE/TBD: An ideal fix is to make zebra not send the second L3VNI delete ZAPI message. However it is a much involved and a day-1 code to handle corner cases. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>

raja-rajasekar and others added 3 commits November 26, 2024 14:07

bgpd: add EVPN route type msg list

55f056c

Adds a msg list for getting strings mapping to enum bgp_evpn_route_type Ticket: #3318830 Signed-off-by: Trey Aspelund <[email protected]>

frrbot bot added bgp zebra labels Nov 26, 2024

github-actions bot added size/XL master labels Nov 26, 2024

raja-rajasekar force-pushed the rajasekarr/evpn_bp_and_optimizations_3864372_FINAL_upstream branch 2 times, most recently from b186ca3 to 6b68753 Compare November 27, 2024 06:53

ton31337 reviewed Nov 27, 2024

View reviewed changes

raja-rajasekar added 3 commits November 26, 2024 23:05

raja-rajasekar force-pushed the rajasekarr/evpn_bp_and_optimizations_3864372_FINAL_upstream branch from 6b68753 to 2cfe7bd Compare November 27, 2024 07:06

raja-rajasekar marked this pull request as draft November 27, 2024 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

raja-rajasekar commented Nov 26, 2024 •

edited

Loading

raja-rajasekar commented Nov 26, 2024

ton31337 left a comment

ton31337 Nov 27, 2024

ton31337 Nov 27, 2024

ton31337 Nov 27, 2024

ton31337 Nov 27, 2024

ton31337 Nov 27, 2024

EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

Are you sure you want to change the base?

EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

Conversation

raja-rajasekar commented Nov 26, 2024 • edited Loading

raja-rajasekar commented Nov 26, 2024

ton31337 left a comment

Choose a reason for hiding this comment

ton31337 Nov 27, 2024

Choose a reason for hiding this comment

ton31337 Nov 27, 2024

Choose a reason for hiding this comment

ton31337 Nov 27, 2024

Choose a reason for hiding this comment

ton31337 Nov 27, 2024

Choose a reason for hiding this comment

ton31337 Nov 27, 2024

Choose a reason for hiding this comment

raja-rajasekar commented Nov 26, 2024 •

edited

Loading