Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't remove next hop groups due to reference counting #5813

Closed
daall opened this issue Nov 4, 2020 · 0 comments · Fixed by sonic-net/sonic-swss#1501
Closed

can't remove next hop groups due to reference counting #5813

daall opened this issue Nov 4, 2020 · 0 comments · Fixed by sonic-net/sonic-swss#1501
Assignees
Labels
Master Branch Quality P1 Priority of the issue, lower than P0

Comments

@daall
Copy link
Contributor

daall commented Nov 4, 2020

Description
As reported in #5758, there is an issue w/ reference counting for next hop groups in orchagent. It looks like during route changes we are often hitting this issue where we can't delete next hop groups because there are still objects referencing them.

Steps to reproduce the issue:

  1. Add and remove a large volume of routes (the route_perf test is good for this)

Describe the results you received:

Oct 30 16:21:50.210939 str-dx010-acs-4 ERR swss#orchagent: :- meta_generic_validation_remove: object 0x5000000000686 reference count is 2, can't remove
Oct 30 16:21:50.210939 str-dx010-acs-4 ERR swss#orchagent: :- removeNextHopGroup: Failed to remove next hop group 5000000000686, rv:-17

Describe the results you expected:
We should be able to add/remove routes w/o issues.

Output of show version:

admin@sonic:~$ show ver

SONiC Software Version: SONiC.master.469-84d3a260
Distribution: Debian 10.6
Kernel: 4.19.0-9-2-amd64
Build commit: 84d3a260
Build date: Wed Nov  4 00:55:31 UTC 2020
Built by: johnar@jenkins-worker-8

Docker images:
REPOSITORY                    TAG                   IMAGE ID            SIZE
docker-snmp                   latest                e21c75cf3e75        427MB
docker-snmp                   master.469-84d3a260   e21c75cf3e75        427MB
docker-teamd                  latest                93c913c5a017        425MB
docker-teamd                  master.469-84d3a260   93c913c5a017        425MB
docker-sonic-mgmt-framework   latest                244b796dd024        551MB
docker-sonic-mgmt-framework   master.469-84d3a260   244b796dd024        551MB
docker-router-advertiser      latest                c31265cbdcd2        390MB
docker-router-advertiser      master.469-84d3a260   c31265cbdcd2        390MB
docker-platform-monitor       latest                106e63b358ad        503MB
docker-platform-monitor       master.469-84d3a260   106e63b358ad        503MB
docker-lldp                   latest                9ba8b14bb844        456MB
docker-lldp                   master.469-84d3a260   9ba8b14bb844        456MB
docker-dhcp-relay             latest                d48a5f726cd7        397MB
docker-dhcp-relay             master.469-84d3a260   d48a5f726cd7        397MB
docker-database               latest                9af7507f9821        390MB
docker-database               master.469-84d3a260   9af7507f9821        390MB
docker-orchagent              latest                6276adebac5b        439MB
docker-orchagent              master.469-84d3a260   6276adebac5b        439MB
docker-nat                    latest                0a1ee7e92192        428MB
docker-nat                    master.469-84d3a260   0a1ee7e92192        428MB
docker-sonic-telemetry        latest                e387512c848f        459MB
docker-sonic-telemetry        master.469-84d3a260   e387512c848f        459MB
docker-fpm-frr                latest                0a87f992c6a5        441MB
docker-fpm-frr                master.469-84d3a260   0a87f992c6a5        441MB
docker-sflow                  latest                c5edda64f062        429MB
docker-sflow                  master.469-84d3a260   c5edda64f062        429MB
docker-syncd-brcm             latest                bac1974ef6a7        527MB
docker-syncd-brcm             master.469-84d3a260   bac1974ef6a7        527MB
@daall daall added P1 Priority of the issue, lower than P0 Master Branch Quality labels Nov 4, 2020
shi-su added a commit to sonic-net/sonic-swss that referenced this issue Nov 16, 2020
What I did
Remove next-hop groups after updating the reference counter for a bulk of routes instead of removing next-hop groups in the loop of updating the reference counter.
Fix sonic-net/sonic-buildimage#5813

Why I did it
The bulk route API has two loops of updating the reference counter:

1. update the sairedis reference counter
2. update the orchagent reference counter

Before this commit, the removal of next-hop groups is triggered in the second loop of updating the orchagent reference counter when the reference counter decreases to zero. This may result in a reference counter mismatch between orchagent and sairedis since the sairedis reference counter has already included the operation of the whole bulk but the orchagent reference counter has not. Therefore, the removal of next-hop group may fail due to the mismatch in reference counter (e.g., there are some other routes point to the next-hop group but has not been counted in orchagent yet).

To fix this problem, the next-hop group removal operation should be done after updating the reference counter of the whole bulk to make sure the reference counters sairedis and orchagent matches.
daall pushed a commit to daall/sonic-swss that referenced this issue Dec 7, 2020
…ic-net#1501)

What I did
Remove next-hop groups after updating the reference counter for a bulk of routes instead of removing next-hop groups in the loop of updating the reference counter.
Fix sonic-net/sonic-buildimage#5813

Why I did it
The bulk route API has two loops of updating the reference counter:

1. update the sairedis reference counter
2. update the orchagent reference counter

Before this commit, the removal of next-hop groups is triggered in the second loop of updating the orchagent reference counter when the reference counter decreases to zero. This may result in a reference counter mismatch between orchagent and sairedis since the sairedis reference counter has already included the operation of the whole bulk but the orchagent reference counter has not. Therefore, the removal of next-hop group may fail due to the mismatch in reference counter (e.g., there are some other routes point to the next-hop group but has not been counted in orchagent yet).

To fix this problem, the next-hop group removal operation should be done after updating the reference counter of the whole bulk to make sure the reference counters sairedis and orchagent matches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Master Branch Quality P1 Priority of the issue, lower than P0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants