Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgpd: allow batch handling of peer shutdown/failure #17505

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

mjstapp
Copy link
Contributor

@mjstapp mjstapp commented Nov 25, 2024

When a peer connection fails or is closed, bgp does cleanup processing on a per-peer basis. At scale, this can become a problem - bgp can be forced to make a complete rib walk to clean up for each peer involved. This PR makes peer error-handling more visible at the bgp object level, and then adds a batching path if there are multiple peers who need cleanup/clearing processing at the same time.

  • Replace the per-peer connection error with a per-bgp event and a list. The io pthread enqueues peers per-bgp-instance, and the error-handing code can process multiple peers if there have been multiple failures.
  • When peer connections encounter errors, attempt to batch some of the clearing processing that occurs. Add a new batch object, add multiple peers to it, if possible. Do one rib walk for the batch, rather than one walk per peer. Use a handler callback per batch to check and remove peers' path-infos, rather than a work-queue and callback per peer. The original clearing code remains; it's used for single peers.

Mark Stapp added 2 commits November 25, 2024 14:13
Replace the per-peer connection error with a per-bgp event and
a list. The io pthread enqueues peers per-bgp-instance, and the
error-handing code can process multiple peers if there have been
multiple failures.

Signed-off-by: Mark Stapp <[email protected]>
Remove a couple of apis that don't exist.

Signed-off-by: Mark Stapp <[email protected]>
@frrbot frrbot bot added bgp tests Topotests, make check, etc zebra labels Nov 25, 2024
Copy link
Member

@ton31337 ton31337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvement ahead!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we switch to frr.conf (unified config)?

Mark Stapp added 3 commits November 26, 2024 08:18
When peer connections encounter errors, attempt to batch some
of the clearing processing that occurs. Add a new batch object,
add multiple peers to it, if possible. Do one rib walk for the
batch, rather than one walk per peer. Use a handler callback
per batch to check and remove peers' path-infos, rather than
a work-queue and callback per peer. The original clearing code
remains; it's used for single peers.

Signed-off-by: Mark Stapp <[email protected]>
Move the peer connection error list to the peer_connection
struct; that seems to line up better with the way that struct
works.

Signed-off-by: Mark Stapp <[email protected]>
Add a simple topotest using multiple bgp peers; based on the
ecmp_topo1 test.

Signed-off-by: Mark Stapp <[email protected]>
@mjstapp
Copy link
Contributor Author

mjstapp commented Nov 26, 2024

Pushed to try to clean up the build problem

Copy link
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good ... waiting on @ton31337 's one comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants