Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

raja-rajasekar
Copy link
Contributor

@raja-rajasekar raja-rajasekar commented Nov 26, 2024

The following are cases where it is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up. Below are few triggers as example

Interface Up/Down events

  • a bulk of L2VNIs are flapped i.e. flap of br_default interface (or)
  • a bulk of L3VNIs are flapped i.e. flap of br_l3vni interface

Anytime BGP gets a L2 VNI ADD (or) L3 VNI ADD/DEL from zebra,

  • Walking the entire global routing table per L2VNI is very expensive.
  • The next read (say of another VNI ADD) from the socket does not proceed unless this walk is complete.

So for triggers where a bulk of L2VNI's/L3VNIs are flapped, this results in huge output buffer FIFO growth spiking up the memory in zebra since bgp is slow/busy processing the first message.

To avoid this, idea is to

  • hookup the VPN off the bgp_master struct and maintain a VPN FIFO list which is processed later on, where we walk a chunk of VPNs and do the remote route install.
  • hookup the BGP-VRF off the struct bgp_master and maintain a struct bgp FIFO list which is processed later on, where we walk a chunk of BGP-VRFs and do the remote route install/uninstall.

Note: So far in the L3 backpressure cases(#15524), we have considered the fact that zebra is slow, and the buffer grows in the BGP.

However, this is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up.

raja-rajasekar and others added 3 commits November 26, 2024 14:07
Currently, zebra-bgp communcication via zapi for L2/L3 VNI allocates
more memory than required.

So, on a system where BGP is slow/busy and zebra is quick, triggers
such as ADD L2/L3 VNI can result in huge buffer growth in zebra
thereby spiking up the memory because a VNI ADD/DEL operation includes
 - Expensive Walk of the entire global routing table per L2/L3 VNI.
 - The next read (say of another VNI ADD/DEL) from the socket does
   not proceed unless the current walk is complete.

This bigger stream allocation accounts a portion to that memory spike.

Fix is to reduce the stream allocation size to a reasonable value when
zebra informs BGP about local EVPN L2/L3 VNI Addition or Deletion.

Note:
- Future commits will optimize the inline global routing table walk for
  triggers where bigger set of VNIs flap (Ex: br_default/br_vni flap).
- Currently, focus is only on communication between zebra and bgp
  for L2/L3 VNI add/del. Need to evaluate this for other zapi msgs.

Ticket :#3864372

Signed-off-by: Donald Sharp [email protected]

Signed-off-by: Rajasekar Raja <[email protected]>
Adds a msg list for getting strings mapping to enum bgp_evpn_route_type

Ticket: #3318830

Signed-off-by: Trey Aspelund <[email protected]>
- For L2vni, struct bgp_master holds a type safe list of all the
  VNIs(struct bgpevpn) that needs to be processed.
- For L3vni, struct bgp_master holds a type safe list of all the
  BGP_VRFs(struct bgp) that needs to be processed.

Future commits will use this.

Ticket :#3864372

Signed-off-by: Rajasekar Raja <[email protected]>
@raja-rajasekar
Copy link
Contributor Author

For "bgpd: Suppress redundant L3VNI delete processing"

Instrumented logs without fix: ifdown br_l3vni

    74 2024/11/26 22:28:00.324443 ZEBRA: [Z9WYD-4ERFV] RAJA DOWN zebra_vxlan_svi_down
    75 2024/11/26 22:28:00.324450 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4004 VRF vrf4 to bgp

 17507 2024/11/26 22:28:00.876116 ZEBRA: [NVFT0-HS1EX] INTF_INSTALL for vxlan99(147)
 17508 2024/11/26 22:28:00.876197 ZEBRA: [WJRZ7-WE7M5] RTM_NEWLINK update for vxlan99(147) sl_type 0 master 0
 17509 2024/11/26 22:28:00.876203 ZEBRA: [PPSYY-6KJJP] Intf vxlan99(147) PTM up, notifying clients
 17510 2024/11/26 22:28:00.876314 ZEBRA: [W7XYW-5FTP2] RAJA999 in if_up
….
 17986 2024/11/26 22:28:00.886309 ZEBRA: [Y4TDE-84YR0] Update L3-VNI 4004 intf vxlan99(147) VLAN 2668 local IP 2.1.1.6 master 0 chg 0x2
 17987 2024/11/26 22:28:00.886311 ZEBRA: [XEW89-KXF0P] RAJA-DOWN 1 zebra_vxlan_if_update_vni for vni 4004
 17988 2024/11/26 22:28:00.886312 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4004 VRF vrf4 to bgp

Instrumented logs without fix: ifup br_l3vni

 359 2024/11/26 22:29:25.427495 ZEBRA: [M37B1-HQHSP] RTM_NEWVLAN for ifindex 147 NS 0, enqueuing for zebra main

 362 2024/11/26 22:29:25.427423 ZEBRA: [K8FXY-V65ZJ] Intf dplane ctx 0x7fc4d0027e10, op INTF_INSTALL, ifindex (147), result QUEUED
 363 2024/11/26 22:29:25.427428 ZEBRA: [NVFT0-HS1EX] INTF_INSTALL for vxlan99(147)
 364 2024/11/26 22:29:25.427465 ZEBRA: [TQR2A-H2RFY] Vlan-Vni(671:671-4005:4005) update for VxLAN IF vxlan99(147)
 365 2024/11/26 22:29:25.427483 ZEBRA: [QZ4F6-8EX79] zebra_vxlan_if_add_update_vni vxlan vxlan99 vni (4005, 671) not present in bridge table
 366 2024/11/26 22:29:25.427486 ZEBRA: [PWSYZ-A537X] zebra_evpn_acc_vl_new access vlan 671 bridge br_l3vni add
 367 2024/11/26 22:29:25.427569 ZEBRA: [Y4TDE-84YR0] Update L3-VNI 4005 intf vxlan99(147) VLAN 671 local IP 2.1.1.6 master 670 chg 0x4
 368 2024/11/26 22:29:25.427573 ZEBRA: [Z0ADA-V8CT4] RAJA-DOWN 3 zebra_vxlan_if_update_vni
 369 2024/11/26 22:29:25.427580 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4005 VRF vrf5 to bgp
….
 386 2024/11/26 22:29:25.428382 ZEBRA: [WVRMN-YEC5Q] Del L3-VNI 4001 intf vxlan99(147)
 387 2024/11/26 22:29:25.428384 ZEBRA: [WDB17-CBPCZ] RAJA DOWNzebra_vxlan_if_del_vni
 388 2024/11/26 22:29:25.428387 ZEBRA: [R43YF-2MKZ3] Send L3VNI DEL 4001 VRF vrf1 to bgp

With Fix: ifdown br_l3vni

   668 2024/11/26 19:18:26.344063 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf5 VNI 4005
   669 2024/11/26 19:18:26.344069 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4005 is already deleted
   670 2024/11/26 19:18:26.344092 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf1 VNI 4001
   671 2024/11/26 19:18:26.344093 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4001 is already deleted
   672 2024/11/26 19:18:26.344114 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf4 VNI 4004
   673 2024/11/26 19:18:26.344115 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4004 is already deleted
   674 2024/11/26 19:18:26.344135 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf3 VNI 4003
   675 2024/11/26 19:18:26.344136 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4003 is already deleted
   676 2024/11/26 19:18:26.344157 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf2 VNI 4002
   677 2024/11/26 19:18:26.344158 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4002 is already deleted
….
   688 2024/11/26 19:18:26.344517 BGP: [XXJ7P-NWW2X] Rx L3VNI ADD VRF vrf3 VNI 4003 Originator-IP 2.1.1.6 RMAC svi-mac 1c:34:da:23:4f:fd vrr-mac 1c:34:da:23:4f:fd filter none svi-if 5517

With Fix: ifup br_l3vni

 8546 2024/11/26 19:26:23.400423 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf1 VNI 4001
  8547 2024/11/26 19:26:23.400435 BGP: [T0MP2-YRTMX] Scheduling L3VNI DEL to be processed later for VRF vrf1 VNI 4001
  8548 2024/11/26 19:26:23.402722 BGP: [GFHWV-99P7C] Rx Intf down VRF vrf1 IF vlan2501_l3
  8549 2024/11/26 19:26:23.404025 BGP: [G49HN-S8M77] Rx Intf address del VRF vrf1 IF vlan2501_l3 addr fe80::1e34:daff:fe23:4ffd/64
  8550 2024/11/26 19:26:23.404397 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf2 VNI 4002
  8551 2024/11/26 19:26:23.404401 BGP: [T0MP2-YRTMX] Scheduling L3VNI DEL to be processed later for VRF vrf2 VNI 4002

140642 2024/11/26 19:26:26.165399 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf1 VNI 4001
140643 2024/11/26 19:26:26.165410 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4001 is already deleted
…..
145672 2024/11/26 19:26:26.236828 BGP: [KHJBD-5KFZX] Rx L3VNI DEL VRF vrf4 VNI 4004
145673 2024/11/26 19:26:26.236836 BGP: [NC0CR-BC1N3] Returning from bgp_evpn_local_l3vni_del since VNI 4004 is already deleted

@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/evpn_bp_and_optimizations_3864372_FINAL_upstream branch 2 times, most recently from b186ca3 to 6b68753 Compare November 27, 2024 06:53
Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't understand the goal of 58e8563. Why not bundle together with the real changes?

if (!(is_evpn_prefix_ipaddr_v4(evp)
|| is_evpn_prefix_ipaddr_v6(evp)))
/* Proceed only for MAC_IP/IP-Pfx routes */
switch (evp->prefix.route_type) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we move to switch, then make sure we cover all enum values (including BGP_EVPN_AD_ROUTE, etc.).

bgp_evpn_local_l3vni_del_post_processing(bgp_to_proc);

UNSET_FLAG(bgp_to_proc->flags, BGP_FLAG_L3VNI_SCHEDULE_FOR_INSTALL);
UNSET_FLAG(bgp_to_proc->flags, BGP_FLAG_L3VNI_SCHEDULE_FOR_DELETE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we do this inside bgp_evpn_local_l3vni_del_post_processing()?

evp->prefix.route_type != BGP_EVPN_MAC_IP_ROUTE)
/* Proceed only for IMET/AD/MAC_IP routes */
switch (evp->prefix.route_type) {
case BGP_EVPN_IMET_ROUTE:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with L3VNIs, we must cover all enum values.

@@ -26,6 +26,7 @@ extern "C" {

/* EVPN route types. */
typedef enum {
BGP_EVPN_UNKN_ROUTE = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is defined as Reserved in RFC 7432, I think we should use something like -1? And in what case it can be unknown?

@@ -47,6 +47,11 @@ typedef uint16_t zebra_size_t;
#define ZEBRA_MAX_PACKET_SIZ 16384U
#define ZEBRA_SMALL_PACKET_SIZE 200U

/* Only for L2/L3 VNI Add/Del */
#define ZEBRA_VNI_MAX_PACKET_SIZE 80U
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these values derived?

Anytime BGP gets a L2 VNI ADD from zebra,
 - Walking the entire global routing table per L2VNI is very expensive.
 - The next read (say of another VNI ADD) from the socket does
   not proceed unless this walk is complete.

So for triggers where a bulk of L2VNI's are flapped, this results in
huge output buffer FIFO growth spiking up the memory in zebra since bgp
is slow/busy processing the first message.

To avoid this, idea is to hookup the VPN off the bgp_master struct and
maintain a VPN FIFO list which is processed later on, where we walk a
chunk of VPNs and do the remote route install.

Note: So far in the L3 backpressure cases(FRRouting#15524), we have considered
the fact that zebra is slow, and the buffer grows in the BGP.

However this is the reverse i.e. BGP is very busy processing the first
ZAPI message from zebra due to which the buffer grows huge in zebra
and memory spikes up.

Ticket :#3864372

Signed-off-by: Rajasekar Raja <[email protected]>
Anytime BGP gets a L3 VNI ADD/DEL from zebra,
 - Walking the entire global routing table per L3VNI is very expensive.
 - The next read (say of another VNI ADD/DEL) from the socket does
   not proceed unless this walk is complete.

So for triggers where a bulk of L3VNI's are flapped, this results in
huge output buffer FIFO growth spiking up the memory in zebra since bgp
is slow/busy processing the first message.

To avoid this, idea is to hookup the BGP-VRF off the struct bgp_master
and maintain a struct bgp FIFO list which is processed later on, where
we walk a chunk of BGP-VRFs and do the remote route install/uninstall.

Ticket :#3864372

Signed-off-by: Rajasekar Raja <[email protected]>
Consider a master bridge interface (br_l3vni) having a slave vxlan99
mapped to vlans used by 3 L3VNIs.

During ifdown br_l3vni interface, the function
zebra_vxlan_process_l3vni_oper_down() where zebra sends ZAPI to bgp for
a delete L3VNI is sent twice.
 1) if_down -> zebra_vxlan_svi_down()
 2) VXLAN is unlinked from the bridge i.e. vxlan99
    zebra_if_dplane_ifp_handling() --> zebra_vxlan_if_update_vni()
    (since ZEBRA_VXLIF_MASTER_CHANGE flag is set)

During ifup br_l3vni interface, the function
zebra_vxlan_process_l3vni_oper_down() is invoked because of access-vlan
change - process oper down, associate with new svi_if and then process
oper up again

The problem here is that the redundant ZAPI message of L3VNI delete
results in BGP doing a inline Global table walk for remote route
installation when the L3VNI is already removed/deleted. Bigger the
scale, more wastage is the CPU utilization.

Given the triggers for bridge flap is not a common scenario, idea is to
simply return from BGP if the L3VNI is already set to 0 i.e.
if the L3VNI is already deleted, do nothing and return.

NOTE/TBD: An ideal fix is to make zebra not send the second L3VNI delete
ZAPI message. However it is a much involved and a day-1 code to handle
corner cases.

Ticket :#3864372

Signed-off-by: Rajasekar Raja <[email protected]>
@raja-rajasekar raja-rajasekar force-pushed the rajasekarr/evpn_bp_and_optimizations_3864372_FINAL_upstream branch from 6b68753 to 2cfe7bd Compare November 27, 2024 07:06
@raja-rajasekar raja-rajasekar marked this pull request as draft November 27, 2024 07:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants