Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Action status ack reporting is not accurate #2596

Closed
juliaElastic opened this issue May 17, 2023 · 14 comments · Fixed by elastic/kibana#161873
Closed

Action status ack reporting is not accurate #2596

juliaElastic opened this issue May 17, 2023 · 14 comments · Fixed by elastic/kibana#161873
Assignees

Comments

@juliaElastic
Copy link
Contributor

While running perf tests with 25k horde drones, encountered an issue where some actions report as not fully acked (e.g. upgrade, reassign, unenroll), although the agents have completed the action.
This is causing a few perf testcases to fail, and have an impact on the Agent activity UI to leave those actions as in progress instead of complete.

At first I thought the issue is caused by some action results not being written out (action ack missing or write failed), but doesn't seem to be the case.
I think that the the /action_status query is not returning all acks.

Originally posted by @juliaElastic in #2519 (comment)

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 17, 2023

I wrote a script to query all agent ids from action results to find the missing acks, and I didn't find any.

Querying all .fleet-action-results hits with searchAfter and comparing the agent ids with those in .fleet-actions to find which agents haven't acked.

{
  action_id: '9bd343a3-9b0f-4daa-86b6-8ba27c2f8821',
  type: 'FORCE_UNENROLL'
}
agents actioned: 25000
fleet-action-results hits: 10000
fleet-action-results hits: 10000
fleet-action-results hits: 5000
action results: 25000
missingAcks: 0
[]

Result of /action_status which counts agent ids with aggregation (should be precise up to 40k), shows 3 missing acks.

    {
      actionId: '9bd343a3-9b0f-4daa-86b6-8ba27c2f8821',
      nbAgentsActionCreated: 25000,
      nbAgentsAck: 24997,
      type: 'FORCE_UNENROLL',
      nbAgentsActioned: 25000,
      status: 'IN_PROGRESS',
      expiration: '2023-06-15T10:07:14.918Z',
      creationTime: '2023-05-16T10:07:14.918Z',
      nbAgentsFailed: 0,
      hasRolloutPeriod: false,
      completionTime: '2023-05-16T10:07:15.391Z',
      latestErrors: []
    },

Agent activity:
image

One more interesting observation, that policy reassign also shows 3 agents in progress, however in the test logs the same action was reported with 25k acks.

I think this proves that the calculation is inconsistent, since we don't delete docs from .fleet-actions-results.

[09:35:39] INFO     oblt-perf.perf_lib: Done FleetActionStatus(actionId='0ad7fc97-a5f7-4c3a-a93a-c59266d5114c', nbAgentsActionCreated=25000,
--
  | nbAgentsAck=25000, nbAgentsActioned=25000, nbAgentsFailed=0, version=None, startTime=None, type='POLICY_REASSIGN',
  | status='COMPLETE', expiration='2023-06-15T09:27:01.326Z', creationTime='2023-05-16T09:27:01.326Z',
  | hasRolloutPeriod=False, completionTime='2023-05-16T09:35:38.000Z', cancellationTime=None,
  | newPolicyId='2a22f2f0-f3bf-11ed-9aea-35ae1276c7c2', policyId=None, revision=None, latestErrors=[]) 8m52s
    {
      "actionId": "0ad7fc97-a5f7-4c3a-a93a-c59266d5114c",
      "nbAgentsActionCreated": 25000,
      "nbAgentsAck": 24997,
      "type": "POLICY_REASSIGN",
      "nbAgentsActioned": 25000,
      "status": "IN_PROGRESS",
      "expiration": "2023-06-15T09:27:01.326Z",
      "newPolicyId": "2a22f2f0-f3bf-11ed-9aea-35ae1276c7c2",
      "creationTime": "2023-05-16T09:27:01.326Z",
      "nbAgentsFailed": 0,
      "hasRolloutPeriod": false,
      "completionTime": "2023-05-16T09:36:02.000Z",
      "latestErrors": []
    },

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 17, 2023

I think the root cause of this is the cardinality agg that we use to find unique agent ids in .fleet-actions-results.
According to the docs, even if we set the precision to 40k, the counts are not guaranteed to be accurate.
This agg was introduced to filter out duplicates from the action results, as sometimes there is more than one doc for one agent.

To solve this, there are a few options:

  • Make sure one agent has only one action result per action, so we could use doc_count and wouldn't need cardinality agg for status calculation.
    • Fleet server could run a search before writing out action result to make sure there is no result existing yet.
  • The action status calculation could query the state of agents to see if an action has completed. Currently not easy to do, as each action has a different end result on the action (e.g. upgrade results in new version on agent)
    • Though in our long term vision, this would work, since we want to move away from action acks and store the action state in agents.
  • In the perf tests we could ignore the actions in progress if the agents are updated, though this doesn't help the UI showing incorrect action state.

@juliaElastic juliaElastic self-assigned this May 17, 2023
@joshdover
Copy link
Contributor

joshdover commented May 17, 2023

  • Fleet server could run a search before writing out action result to make sure there is no result existing yet.

This wouldn't be foolproof because there is a refresh delay between writing a document and being able to search it back. On Serverless, this delay can even be on the order of seconds. So this could still result in writing multiple documents for the same agent if the agent retries or Fleet Server retries the write.

Unfortunately, I don't have a great idea for a solution right now. If we used a regular index instead of a data stream, we could write a document with a specific _id and just "upsert" on subsequent acks onto the same doc, resulting in only a single document.

I believe the reason we use a data stream instead is to be able to support ILM of this data and age it out over time.

An option I can think of that might work but would probably be slow is to track the agent status on the original action document itself with a succeeded_agents: [ ] and failed_agents: [ ] array of agent IDs. This could work because we can update the document. You would probably need to do a scripted update that only adds the agent ID if it's not already present in the array.

The issue is that then this single document becomes a point of contention in the system, leading to a lot of write conflicts, and would definitely slow down acks at scale.

@juliaElastic
Copy link
Contributor Author

If we used a regular index instead of a data stream, we could write a document with a specific _id and just "upsert" on subsequent acks onto the same doc, resulting in only a single document.

I was thinking about this too, though keeping thousands of action result docs in ES might not be the best idea.

The issue is that then this single document becomes a point of contention in the system, leading to a lot of write conflicts, and would definitely slow down acks at scale.

Yeah I think this would not be scalable, keeping 100k+ ids in a field in actions doc.

I had another idea, we could use the action_seq_no field on the agents to find out how many agents acked a certain action. We do something similar with policy change status, where we query policly_revision_idx on the agents to count how many agents picked up the change.

@joshdover
Copy link
Contributor

we could use the action_seq_no field on the agents to find out how many agents acked a certain action.

I think this only tracks delivery of the action, not completion of it?

@juliaElastic
Copy link
Contributor Author

I think this only tracks delivery of the action, not completion of it?

I think you are right, so we don't have anything to indicate on the agent what was the latest acked action? We could introduce a new counter for this.

@cmacknz
Copy link
Member

cmacknz commented May 23, 2023

I think you are right, so we don't have anything to indicate on the agent what was the latest acked action? We could introduce a new counter for this.

IMO the source of truth for acknowledged actions needs to be the Fleet actions indices, not what the agent thinks has been acknowledged. If anything we should be giving the agent the most recently persisted state for outstanding actions when it checks in. This way the agent knows when it can stop trying to send acknowledgements based on what is actually stored in ES. This would eliminate the class of problem where persisting the acknowledgement fails but Fleet Server does not return an error to the agent.

Unfortunately, I don't have a great idea for a solution right now. If we used a regular index instead of a data stream, we could write a document with a specific _id and just "upsert" on subsequent acks onto the same doc, resulting in only a single document.

I believe the reason we use a data stream instead is to be able to support ILM of this data and age it out over time.

If using a regular index would solve this problem, would it be possible to implement a simple form of ILM manually on that index? A periodic task that deletes all documents based on age or state?

@joshdover
Copy link
Contributor

If using a regular index would solve this problem, would it be possible to implement a simple form of ILM manually on that index? A periodic task that deletes all documents based on age or state?

I agree, this seems like the only viable path forward with our current architecture. Upgrading from a prior version will be the hardest part, especially when you consider multiple Fleet Servers coordinating the migration.

@juliaElastic let's chat when you pick this one up, but I think we'll need to do something like:

  • Block writes to the old data stream
  • Create the new index
  • Reindex all of the data from the old data stream to the new index
  • Swap the alias over to the new index
  • Delete the old data stream

We need to make sure not more than one FS node is doing the reindex, most of the other operations are idempotent so it shouldn't be an issue.

We also will have to think through and test what happens if the FS nodes aren't upgraded at the same time, such as an ESS customer who has on-prem FS nodes too.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Jun 6, 2023

Based on discussions about DLM, it seems that we would have to write our own cleanup logic with delete_by_query if we move to a regular index, as DLM will only be supported on data streams.
I think it could be a task that runs regularly and deletes the old documents (e.g. older than 30d or 90d)

Regarding the support of older FS, I think there are 2 options:

  • The new FS does the migration to regular index and reindexes of the data, and the old FS keeps using the alias to add new documents, I think the existing FS insert code should work regardless of having a data stream or index in the background, we should test it.
  • Alternatively we could use a new name/alias for the regular index, and the new FS would only write to the new one. Kibana then would have to read from both existing action results ds and the new index until we remove the ds. This is more work to keep supporting the ds (and add DLM for serverless), and plan for the eventual removal.

EDIT: We just had a meeting about this, and we got to know that it is possible to do an update of a document in a data stream, either by knowing the backing index (for which we would have to query first) or do an update_by_query.
I think we should do a POC with update_by_query on the data stream and see if it would solve the dedupe of results.

Doing a cleanup with delete_by_query has its downsides as well, ES team confirmed that it may come with performance degradation.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Jul 13, 2023

Did a POC to achieve unique action results per action per agent with data stream.

  • Couldn't use update_by_query as it only updates existing documents, does not create new one.
  • Tried update with doc_as_upsert option, but it didn't work with data stream. To know the underlying write index, an additional read would be needed.
  • Changed the action result doc id to be a unique id as actionID:agentID, and ignoring conflict errors. This makes sure to have max 1 document per action per agent. This seems to work well with data streams. Draft pr: using actionID+agentID as action results doc id #2782

So with the last approach, it looks like with can keep using data stream with ILM (stateful) and DLM (stateless).
A kibana change is needed too to remove the cardinality agg when calculating action results, since we can use doc count after the fleet-server change is in. This way the action result calculation should be accurate.

@joshdover
Copy link
Contributor

The solution in #2782 makes sense to me with the current design. My only concern is that I'm not sure if the upcoming enhancements for improved upgrade visibility will require us to update a single doc in place as agents go through the various steps of the upgrade. I believe we'll also be tracking these sub-states on the main agent document, but do we also need this on the actions results?

@juliaElastic
Copy link
Contributor Author

juliaElastic commented Jul 13, 2023

The solution in #2782 makes sense to me with the current design. My only concern is that I'm not sure if the upcoming enhancements for improved upgrade visibility will require us to update a single doc in place as agents go through the various steps of the upgrade. I believe we'll also be tracking these sub-states on the main agent document, but do we also need this on the actions results?

As far as I remember we only plan changes on the agent doc, and not plan to update the action results doc. @ycombinator Could you confirm?

@joshdover
Copy link
Contributor

I just read the RFC again, it doesn't seem we'll need to update action results. Think you're good to go on #2782 👍

@ycombinator
Copy link
Contributor

The solution in #2782 makes sense to me with the current design. My only concern is that I'm not sure if the upcoming enhancements for improved upgrade visibility will require us to update a single doc in place as agents go through the various steps of the upgrade. I believe we'll also be tracking these sub-states on the main agent document, but do we also need this on the actions results?

As far as I remember we only plan changes on the agent doc, and not plan to update the action results doc. @ycombinator Could you confirm?

Yes, as the RFC stands right now, there should be no changes on the action results doc.

juliaElastic added a commit to elastic/kibana that referenced this issue Jul 14, 2023
## Summary

Closes elastic/fleet-server#2596

Depends on elastic/fleet-server#2782

After changes in fleet-server, there will be no duplicate action results
per agent, so the agent status query can be simplified to use
`doc_count` instead of cardinality agg.
This should eliminate the bug where the calculation is not accurate on
higher scale.

To verify, run scale tests on 25-50-75k and verify that the status calc
is accurate (not reporting more or less action results than the actual
actioned agents).


### Checklist

- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants