Refactor and rethink the coordinator/monitor implementation #1744

ph · 2022-08-15T15:22:13Z

When an elastic policy is dispatched to fleet server, one of the fleet server will be elected and take take the control of some of the management task, like when to unenroll agent after a last seen timeout. I've fixed a race condition issue discovered in #1738. When I've jumped the code I found it a bit harder to ready to really understand the flow of events and when goroutine were created and removed.

We need to evaluate if we need to add logic to this area of the code, if this is the case we should really invest some time in refactoring the logic, here a few things to consider changing:

We create a goroutine per agent policy, as the number of policy is low this is perfectly fine, but I think that logic could be handled by a single cleanup event loop in a single goroutine.
We are exposing internal fields from monitorT object in the test suite, we should hide all access to the internal field using accessor even if this is only for testing. This allow a single locking logic.
Looking at the code, It look like possible the usage of multiple internal fields into a watcher struct that would encapsulate more logic.
Internal state of the monitor bleeds into the goroutine execution, this make it harder to lock or prevent concurrent access to the resource. Encapsulating that logic into his own object would make it simple to test and verifies.

kpollich · 2022-10-17T12:31:20Z

This issue is a little out of my wheelhouse in terms of expertise, but it seems like a very nuanced technical debt issue. I think the best path forward is to keep this near the top of the backlog and look to take this on during feature freeze or ON week in the near future.

@michel-laterman - let me know if you have other thoughts here. Curious about how mission critical this refactor might be or if this would solve any major issues we have with the Fleet Server codebase today.

joshdover · 2023-05-23T15:45:22Z

We ran into the coordinator again in #2606

I think it will make sense to remove it at some point.

ph added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Aug 15, 2022

ph mentioned this issue Aug 15, 2022

Protect access to policiesCanceller using a mutext #1739

Merged

1 task

jen-huang added Team:Fleet Label for the Fleet team and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Sep 12, 2022

jlind23 assigned kpollich Sep 26, 2022

kpollich removed their assignment Oct 18, 2022

michel-laterman self-assigned this Dec 18, 2023

michel-laterman mentioned this issue Dec 18, 2023

Remove the coordinator #3131

Merged

5 tasks

michel-laterman mentioned this issue Apr 16, 2024

policy monitor improvements #3470

Open

michel-laterman closed this as completed in #3131 Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and rethink the coordinator/monitor implementation #1744

Refactor and rethink the coordinator/monitor implementation #1744

ph commented Aug 15, 2022

kpollich commented Oct 17, 2022

joshdover commented May 23, 2023

Refactor and rethink the coordinator/monitor implementation #1744

Refactor and rethink the coordinator/monitor implementation #1744

Comments

ph commented Aug 15, 2022

kpollich commented Oct 17, 2022

joshdover commented May 23, 2023