-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526
base: master
Are you sure you want to change the base?
EVPN L2VNI/L3VNI Optimize inline Global walk for remote route installations #17526
Conversation
Currently, zebra-bgp communcication via zapi for L2/L3 VNI allocates more memory than required. So, on a system where BGP is slow/busy and zebra is quick, triggers such as ADD L2/L3 VNI can result in huge buffer growth in zebra thereby spiking up the memory because a VNI ADD/DEL operation includes - Expensive Walk of the entire global routing table per L2/L3 VNI. - The next read (say of another VNI ADD/DEL) from the socket does not proceed unless the current walk is complete. This bigger stream allocation accounts a portion to that memory spike. Fix is to reduce the stream allocation size to a reasonable value when zebra informs BGP about local EVPN L2/L3 VNI Addition or Deletion. Note: - Future commits will optimize the inline global routing table walk for triggers where bigger set of VNIs flap (Ex: br_default/br_vni flap). - Currently, focus is only on communication between zebra and bgp for L2/L3 VNI add/del. Need to evaluate this for other zapi msgs. Ticket :#3864372 Signed-off-by: Donald Sharp [email protected] Signed-off-by: Rajasekar Raja <[email protected]>
Adds a msg list for getting strings mapping to enum bgp_evpn_route_type Ticket: #3318830 Signed-off-by: Trey Aspelund <[email protected]>
- For L2vni, struct bgp_master holds a type safe list of all the VNIs(struct bgpevpn) that needs to be processed. - For L3vni, struct bgp_master holds a type safe list of all the BGP_VRFs(struct bgp) that needs to be processed. Future commits will use this. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>
For "bgpd: Suppress redundant L3VNI delete processing" Instrumented logs without fix: ifdown br_l3vni
Instrumented logs without fix: ifup br_l3vni
With Fix: ifdown br_l3vni
With Fix: ifup br_l3vni
|
b186ca3
to
6b68753
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I don't understand the goal of 58e8563. Why not bundle together with the real changes?
if (!(is_evpn_prefix_ipaddr_v4(evp) | ||
|| is_evpn_prefix_ipaddr_v6(evp))) | ||
/* Proceed only for MAC_IP/IP-Pfx routes */ | ||
switch (evp->prefix.route_type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we move to switch
, then make sure we cover all enum values (including BGP_EVPN_AD_ROUTE, etc.).
bgp_evpn_local_l3vni_del_post_processing(bgp_to_proc); | ||
|
||
UNSET_FLAG(bgp_to_proc->flags, BGP_FLAG_L3VNI_SCHEDULE_FOR_INSTALL); | ||
UNSET_FLAG(bgp_to_proc->flags, BGP_FLAG_L3VNI_SCHEDULE_FOR_DELETE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we do this inside bgp_evpn_local_l3vni_del_post_processing()
?
evp->prefix.route_type != BGP_EVPN_MAC_IP_ROUTE) | ||
/* Proceed only for IMET/AD/MAC_IP routes */ | ||
switch (evp->prefix.route_type) { | ||
case BGP_EVPN_IMET_ROUTE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as with L3VNIs, we must cover all enum values.
@@ -26,6 +26,7 @@ extern "C" { | |||
|
|||
/* EVPN route types. */ | |||
typedef enum { | |||
BGP_EVPN_UNKN_ROUTE = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 is defined as Reserved in RFC 7432, I think we should use something like -1? And in what case it can be unknown?
@@ -47,6 +47,11 @@ typedef uint16_t zebra_size_t; | |||
#define ZEBRA_MAX_PACKET_SIZ 16384U | |||
#define ZEBRA_SMALL_PACKET_SIZE 200U | |||
|
|||
/* Only for L2/L3 VNI Add/Del */ | |||
#define ZEBRA_VNI_MAX_PACKET_SIZE 80U |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are these values derived?
Anytime BGP gets a L2 VNI ADD from zebra, - Walking the entire global routing table per L2VNI is very expensive. - The next read (say of another VNI ADD) from the socket does not proceed unless this walk is complete. So for triggers where a bulk of L2VNI's are flapped, this results in huge output buffer FIFO growth spiking up the memory in zebra since bgp is slow/busy processing the first message. To avoid this, idea is to hookup the VPN off the bgp_master struct and maintain a VPN FIFO list which is processed later on, where we walk a chunk of VPNs and do the remote route install. Note: So far in the L3 backpressure cases(FRRouting#15524), we have considered the fact that zebra is slow, and the buffer grows in the BGP. However this is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>
Anytime BGP gets a L3 VNI ADD/DEL from zebra, - Walking the entire global routing table per L3VNI is very expensive. - The next read (say of another VNI ADD/DEL) from the socket does not proceed unless this walk is complete. So for triggers where a bulk of L3VNI's are flapped, this results in huge output buffer FIFO growth spiking up the memory in zebra since bgp is slow/busy processing the first message. To avoid this, idea is to hookup the BGP-VRF off the struct bgp_master and maintain a struct bgp FIFO list which is processed later on, where we walk a chunk of BGP-VRFs and do the remote route install/uninstall. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>
Consider a master bridge interface (br_l3vni) having a slave vxlan99 mapped to vlans used by 3 L3VNIs. During ifdown br_l3vni interface, the function zebra_vxlan_process_l3vni_oper_down() where zebra sends ZAPI to bgp for a delete L3VNI is sent twice. 1) if_down -> zebra_vxlan_svi_down() 2) VXLAN is unlinked from the bridge i.e. vxlan99 zebra_if_dplane_ifp_handling() --> zebra_vxlan_if_update_vni() (since ZEBRA_VXLIF_MASTER_CHANGE flag is set) During ifup br_l3vni interface, the function zebra_vxlan_process_l3vni_oper_down() is invoked because of access-vlan change - process oper down, associate with new svi_if and then process oper up again The problem here is that the redundant ZAPI message of L3VNI delete results in BGP doing a inline Global table walk for remote route installation when the L3VNI is already removed/deleted. Bigger the scale, more wastage is the CPU utilization. Given the triggers for bridge flap is not a common scenario, idea is to simply return from BGP if the L3VNI is already set to 0 i.e. if the L3VNI is already deleted, do nothing and return. NOTE/TBD: An ideal fix is to make zebra not send the second L3VNI delete ZAPI message. However it is a much involved and a day-1 code to handle corner cases. Ticket :#3864372 Signed-off-by: Rajasekar Raja <[email protected]>
6b68753
to
2cfe7bd
Compare
The following are cases where it is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up. Below are few triggers as example
Interface Up/Down events
Anytime BGP gets a L2 VNI ADD (or) L3 VNI ADD/DEL from zebra,
So for triggers where a bulk of L2VNI's/L3VNIs are flapped, this results in huge output buffer FIFO growth spiking up the memory in zebra since bgp is slow/busy processing the first message.
To avoid this, idea is to
Note: So far in the L3 backpressure cases(#15524), we have considered the fact that zebra is slow, and the buffer grows in the BGP.
However, this is the reverse i.e. BGP is very busy processing the first ZAPI message from zebra due to which the buffer grows huge in zebra and memory spikes up.