Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #5157 by identifying the dependency among objects and avoiding releasing an object still being referenced #1440

Merged
merged 3 commits into from
Oct 13, 2020

Conversation

stephenxs
Copy link
Collaborator

What I did
The issue is caused by the orchagent receiving notification in a different order in which they were sent.
orchagent doesn't have any dependency check and try notifying sai-redis to release an object which is still being referenced,
which causes sai-redis to complain and the object leaks.

The idea is to introduce a mechanism to identify the dependency thus preventing a referenced object from being released.

Signed-off-by: Stephen Sun [email protected]

Why I did it

Details if related

  1. Introduce a new type representing the dependency among the variant type of objects, including the following fields:

    • m_objsDependingOnMe, a set representing the objects that reference the current object.
    • m_objsReferencingByMe, a map from a field of the current object's to the object name it references.

    eg. given the following dependencies,

    • BUFFER_PROFILE.ingress_lossless_profile references BUFFER_POOL.ingress_lossless_pool,
    • BUFFER_PROFILE.pg_lossless_25000_5m_profile references BUFFER_POOL.ingress_lossless_pool
      the objects should be like this:
obj(BUFFER_POOL.ingress_lossless_pool).m_objsDependingOnMe = [
        BUFFER_PROFILE.ingress_lossless_profile,
        BUFFER_PROFILE.pg_lossless_25000_5m_profile
]
obj(BUFFER_PROFILE.pg_lossless_25000_5m_profile).m_objsReferencingByMe[pool] = BUFFER_POOL.ingress_lossless_pool
obj(BUFFER_PROFILE.ingress_lossless_profile).m_objsReferencingByMe[pool] = BUFFER_POOL.ingress_lossless_pool
  1. When a field of an object A

    • is referencing B now
    • is going to be updated with referencing another object B',

    We will have the following steps:

    • obj[A.m_objsReferencingByMe[field name]].m_objsDependingOnMe.remove(A)
      where obj[A.m_objsReferencingByMe[field name]] should be B
    • A.m_objsReferencingByMe[field name] = B'
    • B'.m_objsDependingOnMe.add(A)
  2. When an object A is about to be removed,

    • if A.m_objsDependingOnMe isn't empty set, return task_need_retry
    • else execute the normal remove flow, especially if the A is referencing someone else:
      • For each field in A.m_objsReferencingByMe: obj[A.m_objsReferencingByMe[field]].m_objsDependingOnMe.remove(A)

Reference count vs. set
When it comes to recording the references, the reference count approach is the solution a lot of developers come up with.
A frequently asked question is why the set is used to record dependency instead of the reference count.
So I'd like to compare these two approaches.

  • for reference count
    • pro
      • small memory footprint
      • able to trace a large number of objects depending on the current object
    • con
      • need an extra flag for each referencing objects, indicating whether the reference count has been increased/decreased for the object
      • misoperate reference count leads to fatal error
  • for the set of dependent objects
    • pro
      • able to provide more information to the user regarding the dependency, which is very helpful when debugging an error
      • no extra flag required. as the insert/erase operations are idempotent, misoperation on reference count is less likely to lead to fatal error
    • con
      • heavy memory footprint

As we don't have a big number of objects to trace and want to provide more detailed info, I prefer the set approach.

How I verified it

  1. remove objects with non-zero reference.
    to create a buffer pool, a buffer profile and two buffer pgs in order and then remove the profile and then the pool and at last the two PGs. in this case, the profile and pool can't be removed because of non-zero referencing.
    • create the buffer pool by using the command hset BUFFER_POOL|ingress_test_pool type ingress mode dynamic size 14542848. the log is like following:
      NOTICE swss#orchagent: :- processBufferPool: Created buffer pool ingress_test_pool with type BUFFER_POOL
      
    • create the buffer profile referencing the pool by using the command hset BUFFER_PROFILE|ingress_test_profile dynamic_th 1 pool [BUFFER_POOL|ingress_test_pool] size 0.
      NOTICE swss#orchagent: :- processBufferProfile: Created buffer profile ingress_test_profile with type BUFFER_PROFILE
      INFO swss#orchagent: :- setObjectReference: Obj BUFFER_PROFILE.ingress_test_profile Field pool: Add reference to BUFFER_POOL ingress_test_pool (now 1)
      
    • create a buffer PG referencing the profile by using the command hset BUFFER_PG|Ethernet8|1 profile [BUFFER_PROFILE|ingress_test_profile].
      NOTICE swss#orchagent: :- processPriorityGroup: Set buffer PG Ethernet8|1 to BUFFER_PROFILE:ingress_test_profile
      INFO swss#orchagent: :- setObjectReference: Obj BUFFER_PG.Ethernet8|1 Field profile: Add reference to BUFFER_PROFILE ingress_test_profile (now 1)
      
    • create the other buffer PG by using hset BUFFER_PG|Ethernet16|2 profile [BUFFER_PROFILE|ingress_test_profile]
      NOTICE swss#orchagent: :- processPriorityGroup: Set buffer PG Ethernet16|2 to BUFFER_PROFILE:ingress_test_profile
      INFO swss#orchagent: :- setObjectReference: Obj BUFFER_PG.Ethernet16|2 Field profile: Add reference to BUFFER_PROFILE ingress_test_profile (now 2)
      
    • remove the buffer profile by using the command hdel BUFFER_PROFILE|ingress_test_profile dynamic_th pool size. It can't be removed because of dependency.
      NOTICE swss#orchagent: :- processBufferProfile: Can't remove object ingress_test_profile due to being referenced (BUFFER_PROFILE ingress_test_profile first object: Ethernet16|2 reference count: 2)
      INFO swss#orchagent: :- doTask: Failed to process buffer task, retry it
      
    • remove the buffer pool by using the command hdel BUFFER_POOL|ingress_test_pool type mode size. It can't be removed because of dependency.
      NOTICE swss#orchagent: :- processBufferProfile: Can't remove object ingress_test_profile due to being referenced (BUFFER_PROFILE ingress_test_profile first object: Ethernet16|2 reference count: 2)
      INFO swss#orchagent: :- doTask: Failed to process buffer task, retry it
      
    • remove one buffer PG by using the command hdel BUFFER_PG|Ethernet8|1 profile. the reference count of buffer profile is reduced to 1.
      NOTICE swss#orchagent: :- processPriorityGroup: Remove buffer PG Ethernet8|1
      INFO swss#orchagent: :- removeMeFromObjsReferencedByMe: Obj BUFFER_PG.Ethernet8|1 Field profile: Remove reference to BUFFER_PROFILE ingress_test_profile (now 1)
      INFO swss#orchagent: :- removeObject: Obj BUFFER_PG:Ethernet8|1 is removed from store
      
    • remove the other buffer PG by using hdel BUFFER_PG|Ethernet16|2 profile. Now the reference of the profile is decreased to 0 and then the pool. Eventually, all objects are removed.
      NOTICE swss#orchagent: :- processPriorityGroup: Remove buffer PG Ethernet16|2
      INFO swss#orchagent: :- removeMeFromObjsReferencedByMe: Obj BUFFER_PG.Ethernet16|2 Field profile: Remove reference to BUFFER_PROFILE ingress_test_profile (now 0)
      INFO swss#orchagent: :- removeObject: Obj BUFFER_PG:Ethernet16|2 is removed from store
      ERR swss#orchagent: :- processPriorityGroup: PG profile 'Ethernet16|2' applied after port Ethernet16 is up
      NOTICE swss#orchagent: :- processBufferPool: Can't remove object ingress_test_pool due to being referenced (BUFFER_POOL ingress_test_pool first object: ingress_test_profile reference count: 1)
      INFO swss#orchagent: :- doTask: Failed to process buffer task, retry it
      NOTICE swss#orchagent: :- processBufferProfile: Remove buffer profile ingress_test_profile with type BUFFER_PROFILE
      INFO swss#orchagent: :- removeMeFromObjsReferencedByMe: Obj BUFFER_PROFILE.ingress_test_profile Field pool: Remove reference to BUFFER_POOL ingress_test_pool (now 0)
      INFO swss#orchagent: :- removeObject: Obj BUFFER_PROFILE:ingress_test_profile is removed from store
      NOTICE swss#orchagent: :- processBufferPool: Removed buffer pool ingress_test_pool with type BUFFER_POOL
      
  2. remove the object with zero-reference.
    create the same objects and destroy buffer PGs first and then buffer profiles and at last buffer pools. In this case the objects are destroyed in order.
    • create the buffer pool: hset BUFFER_POOL|ingress_test_pool type ingress mode dynamic size 14542848
      NOTICE swss#orchagent: :- processBufferPool: Created buffer pool ingress_test_pool with type BUFFER_POOL
      
    • create the buffer profile: hset BUFFER_PROFILE|ingress_test_profile dynamic_th 1 pool [BUFFER_POOL|ingress_test_pool] size 0
      NOTICE swss#orchagent: :- processBufferProfile: Created buffer profile ingress_test_profile with type BUFFER_PROFILE
      INFO swss#orchagent: :- setObjectReference: Obj BUFFER_PROFILE.ingress_test_profile Field pool: Add reference to BUFFER_POOL ingress_test_pool (now 1)
      
    • create buffer PGs: hset BUFFER_PG|Ethernet8|1 profile [BUFFER_PROFILE|ingress_test_profile] and hset BUFFER_PG|Ethernet16|2 profile [BUFFER_PROFILE|ingress_test_profile]
      NOTICE swss#orchagent: :- processPriorityGroup: Set buffer PG Ethernet8|1 to BUFFER_PROFILE:ingress_test_profile
      INFO swss#orchagent: :- setObjectReference: Obj BUFFER_PG.Ethernet8|1 Field profile: Add reference to BUFFER_PROFILE ingress_test_profile (now 1)
      NOTICE swss#orchagent: :- processPriorityGroup: Set buffer PG Ethernet16|2 to BUFFER_PROFILE:ingress_test_profile
      INFO swss#orchagent: :- setObjectReference: Obj BUFFER_PG.Ethernet16|2 Field profile: Add reference to BUFFER_PROFILE ingress_test_profile (now 2)
      
    • destroy the buffer PGs: hdel BUFFER_PG|Ethernet16|2 profile and hdel BUFFER_PG|Ethernet8|1 profile
      NOTICE swss#orchagent: :- processPriorityGroup: Remove buffer PG Ethernet16|2
      INFO swss#orchagent: :- removeMeFromObjsReferencedByMe: Obj BUFFER_PG.Ethernet16|2 Field profile: Remove reference to BUFFER_PROFILE ingress_test_profile (now 1)
      INFO swss#orchagent: :- removeObject: Obj BUFFER_PG:Ethernet16|2 is removed from store
      NOTICE swss#orchagent: :- processPriorityGroup: Remove buffer PG Ethernet8|1
      INFO swss#orchagent: :- removeMeFromObjsReferencedByMe: Obj BUFFER_PG.Ethernet8|1 Field profile: Remove reference to BUFFER_PROFILE ingress_test_profile (now 0)
      INFO swss#orchagent: :- removeObject: Obj BUFFER_PG:Ethernet8|1 is removed from store
      
    • destroy the buffer profile: hdel BUFFER_PROFILE|ingress_test_profile dynamic_th pool size
      NOTICE swss#orchagent: :- processBufferProfile: Remove buffer profile ingress_test_profile with type BUFFER_PROFILE
      INFO swss#orchagent: :- removeMeFromObjsReferencedByMe: Obj BUFFER_PROFILE.ingress_test_profile Field pool: Remove reference to BUFFER_POOL ingress_test_pool (now 0)
      INFO swss#orchagent: :- removeObject: Obj BUFFER_PROFILE:ingress_test_profile is removed from store
      
    • destroy the buffer pool: hdel BUFFER_POOL|ingress_test_pool type mode size
      NOTICE swss#orchagent: :- processBufferPool: Removed buffer pool ingress_test_pool with type BUFFER_POOL
      

…ing releasing an object still being referenced

The issue is caused by the OA receives notification in an different order in which they were sent.
OA doesn't have any dependency check try notifying sai-redis to release an object which is still being referenced,
which causes sai-redis complain and the object leaks.

The idea is to introduce a mechanism to identify the dependency thus preventing a referenced object from being released.

1. Introduce a new type representing the dependency among variant type of objects, including the following fields:
  - m_objsDependingOnMe, a set representing the objects that references the current object.
    eg. BUFFER_PROFILE.ingress_lossless_profile references BUFFER_POOL.ingress_lossless_pool
  - m_objsReferencingByMe, a map from a field of the current object's to the object name it references.
2. When a field of an object A has been updated with referencing another object B,
  - obj[A.m_objsReferencingByMe[field name]].m_objsDependingOnMe.remove(A)
  - A.m_objsReferencingByMe[field name] = B
3. When a an object A is about to be removed,
  - if obj.m_objsDependingOnMe isn't empty set, return task_need_retry else execute the normal remove flow.

Signed-off-by: Stephen Sun <[email protected]>
@lgtm-com
Copy link

lgtm-com bot commented Sep 17, 2020

This pull request introduces 1 alert when merging a0bcda7 into 5b0f7be - view on LGTM.com

new alerts:

  • 1 for Declaration hides parameter

@stephenxs
Copy link
Collaborator Author

retest vs, please

Signed-off-by: Stephen Sun <[email protected]>
@stephenxs
Copy link
Collaborator Author

Hi @neethajohn ,
All the comments have been fixed/replied to. Could you please approve the PR?
Thanks.

@kcudnik
Copy link
Contributor

kcudnik commented Jan 25, 2021

@stephenxs can you fix this:

orch.cpp: In member function 'ref_resolve_status Orch::resolveFieldRefArray(type_map&, const string&, swss::KeyOpFieldsValuesTuple&, std::vector<long unsigned int>&, std::string&)':
orch.cpp:609:41: error: 'strlen' argument missing terminating nul [-Werror=stringop-overflow=]
  609 |                     object_name_list += string(&list_item_delimiter);
      |                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from orch.cpp:6:
orch.h:25:12: note: referenced argument declared here
   25 | const char list_item_delimiter = ',';
      |            ^~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

i get this error on master with all enabled errors, im not sure why this don't pop up on jenkins build, my gcc: gcc version 9.1.0 (GCC)

@stephenxs
Copy link
Collaborator Author

Hi @kcudnik ,
Looks like I'm not able to install gcc-9 on buster docker. Could you please provide steps in which you installed it?
I also searched this error and got this: GCC 9: special_values_formatter.hpp:43:16: error: 'strlen' argument missing terminating nul #29. According to this page, it's a bug of boost while compiling by gcc9 and can be suppressed by using -Wno-stringop-overflow which disables this warning.
Can you also check whether this works for you?
Thanks

@kcudnik
Copy link
Contributor

kcudnik commented Jan 27, 2021

ok, can you add then this option co configure?

EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
As per latest update in DPB DOC, fixed this bug

previously we had string value in "breakout_modes" key so it was not matching the whole string, But after the update via, now "breakout_modes" contain a dictionary where key is the breakout_mode and value is the alias. So we can easily check whether the key is present or not.

Signed-off-by: Sangita Maity <[email protected]>
Co-authored-by: Guohan Lu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants