You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adaptation unit tests fail, ever since c4893c7. The reason is that while that commit fixed a locking inversion bug, it now opened a time window for a plugin to lose events during registration after it has received the initial set of pods and containers but before it is has been fully registered for event processing.
Synchronizing the plugin and adding it to the slice of registered plugins both used to happen with Adaptation.Mutex held. Now that mutex is held only after synchronization is done while updating the slice of plugins. As a consequence to the adaptation test suite while Suite.WaitForPluginsToSync() used to return only once the plugin was fully registered (IOW, added to slice of plugins), now it returns before that. Therefore, it became possible for a pod/container event to come when a plugin is halfway through its registration: already synchronized with existing pods/containers, but not added yet to the slice of plugins. When that happens, the plugin loses any event that comes before it has been added to the slice of plugins.
The effect is not limited to unit tests, but it becomes apparent as unit tests banging full steam registering plugins and generating pod/container events now trigger this race condition reliably. Similar NRI pod/container event loss is also possible now in the runtimes.
The text was updated successfully, but these errors were encountered:
Here is a proposed fix. It requires additional co-operation from the NRI integration code on the runtime side. The runtime should request blocking of plugin registration during event processing.
klihub
changed the title
adpatation: unit test failures.
adpatation: plugin registration/sync race, unit test failures.
Jul 19, 2024
after it has received the initial set of pods and containers but before it is has been fully registered for event processing.
It seems to me that the order should really be the opposite: a plugin should be fully registered for events (and start receiving immediately), and then synchronize to see any pods/containers that existed prior to subscription to the event stream. This would require some additional logic to handle deduplication (since some pods/containers may be received both event-based during the initial registration and during that sync), but would not require any locking or pausing of events or registration.
Adaptation unit tests fail, ever since c4893c7. The reason is that while that commit fixed a locking inversion bug, it now opened a time window for a plugin to lose events during registration after it has received the initial set of pods and containers but before it is has been fully registered for event processing.
Synchronizing the plugin and adding it to the slice of registered plugins both used to happen with
Adaptation.Mutex
held. Now that mutex is held only after synchronization is done while updating the slice of plugins. As a consequence to the adaptation test suite whileSuite.WaitForPluginsToSync()
used to return only once the plugin was fully registered (IOW, added to slice of plugins), now it returns before that. Therefore, it became possible for a pod/container event to come when a plugin is halfway through its registration: already synchronized with existing pods/containers, but not added yet to the slice of plugins. When that happens, the plugin loses any event that comes before it has been added to the slice of plugins.The effect is not limited to unit tests, but it becomes apparent as unit tests banging full steam registering plugins and generating pod/container events now trigger this race condition reliably. Similar NRI pod/container event loss is also possible now in the runtimes.
The text was updated successfully, but these errors were encountered: