-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Action status ack reporting is not accurate #2596
Comments
I wrote a script to query all agent ids from action results to find the missing acks, and I didn't find any. Querying all
Result of
One more interesting observation, that policy reassign also shows 3 agents in progress, however in the test logs the same action was reported with 25k acks. I think this proves that the calculation is inconsistent, since we don't delete docs from
|
I think the root cause of this is the cardinality agg that we use to find unique agent ids in To solve this, there are a few options:
|
This wouldn't be foolproof because there is a refresh delay between writing a document and being able to search it back. On Serverless, this delay can even be on the order of seconds. So this could still result in writing multiple documents for the same agent if the agent retries or Fleet Server retries the write. Unfortunately, I don't have a great idea for a solution right now. If we used a regular index instead of a data stream, we could write a document with a specific I believe the reason we use a data stream instead is to be able to support ILM of this data and age it out over time. An option I can think of that might work but would probably be slow is to track the agent status on the original action document itself with a The issue is that then this single document becomes a point of contention in the system, leading to a lot of write conflicts, and would definitely slow down acks at scale. |
I was thinking about this too, though keeping thousands of action result docs in ES might not be the best idea.
Yeah I think this would not be scalable, keeping 100k+ ids in a field in actions doc. I had another idea, we could use the |
I think this only tracks delivery of the action, not completion of it? |
I think you are right, so we don't have anything to indicate on the agent what was the latest acked action? We could introduce a new counter for this. |
IMO the source of truth for acknowledged actions needs to be the Fleet actions indices, not what the agent thinks has been acknowledged. If anything we should be giving the agent the most recently persisted state for outstanding actions when it checks in. This way the agent knows when it can stop trying to send acknowledgements based on what is actually stored in ES. This would eliminate the class of problem where persisting the acknowledgement fails but Fleet Server does not return an error to the agent.
If using a regular index would solve this problem, would it be possible to implement a simple form of ILM manually on that index? A periodic task that deletes all documents based on age or state? |
I agree, this seems like the only viable path forward with our current architecture. Upgrading from a prior version will be the hardest part, especially when you consider multiple Fleet Servers coordinating the migration. @juliaElastic let's chat when you pick this one up, but I think we'll need to do something like:
We need to make sure not more than one FS node is doing the reindex, most of the other operations are idempotent so it shouldn't be an issue. We also will have to think through and test what happens if the FS nodes aren't upgraded at the same time, such as an ESS customer who has on-prem FS nodes too. |
Based on discussions about DLM, it seems that we would have to write our own cleanup logic with Regarding the support of older FS, I think there are 2 options:
EDIT: We just had a meeting about this, and we got to know that it is possible to do an update of a document in a data stream, either by knowing the backing index (for which we would have to query first) or do an Doing a cleanup with |
Did a POC to achieve unique action results per action per agent with data stream.
So with the last approach, it looks like with can keep using data stream with ILM (stateful) and DLM (stateless). |
The solution in #2782 makes sense to me with the current design. My only concern is that I'm not sure if the upcoming enhancements for improved upgrade visibility will require us to update a single doc in place as agents go through the various steps of the upgrade. I believe we'll also be tracking these sub-states on the main agent document, but do we also need this on the actions results? |
As far as I remember we only plan changes on the agent doc, and not plan to update the action results doc. @ycombinator Could you confirm? |
I just read the RFC again, it doesn't seem we'll need to update action results. Think you're good to go on #2782 👍 |
Yes, as the RFC stands right now, there should be no changes on the action results doc. |
## Summary Closes elastic/fleet-server#2596 Depends on elastic/fleet-server#2782 After changes in fleet-server, there will be no duplicate action results per agent, so the agent status query can be simplified to use `doc_count` instead of cardinality agg. This should eliminate the bug where the calculation is not accurate on higher scale. To verify, run scale tests on 25-50-75k and verify that the status calc is accurate (not reporting more or less action results than the actual actioned agents). ### Checklist - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
While running perf tests with 25k horde drones, encountered an issue where some actions report as not fully acked (e.g. upgrade, reassign, unenroll), although the agents have completed the action.
This is causing a few perf testcases to fail, and have an impact on the Agent activity UI to leave those actions as in progress instead of complete.
At first I thought the issue is caused by some action results not being written out (action ack missing or write failed), but doesn't seem to be the case.
I think that the the
/action_status
query is not returning all acks.Originally posted by @juliaElastic in #2519 (comment)
The text was updated successfully, but these errors were encountered: