Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade action sometimes leaves agent in Updating state #2519

Closed
juliaElastic opened this issue Apr 20, 2023 · 21 comments
Closed

Upgrade action sometimes leaves agent in Updating state #2519

juliaElastic opened this issue Apr 20, 2023 · 21 comments
Assignees
Labels
Project:FleetScaling Team:Fleet Label for the Fleet team

Comments

@juliaElastic
Copy link
Contributor

While investigating the issue of some horde drones converging slowly during an upgrade, encountered an issue where a few drones were left in Updating state indefinitely (no change in a few hours).

When looking at the logs, I saw that the action was delivered, and during ack it encountered an error when writing out action result (temporary connection issue with ES): https://github.com/elastic/fleet-server/blob/main/internal/pkg/api/handleAck.go#L323

It seems that when hitting this error, the upgrade action didn't continue and didn't retry, the agent was left in Updating state.

This could be solved by the redesign of upgrade action https://github.com/elastic/ingest-dev/issues/1621

@joshdover
Copy link
Contributor

Let's not couple this to the https://github.com/elastic/ingest-dev/issues/1621 as that is likely to be smaller in scope to begin with.

Do you have any logs of what the error was? I remember @blakerouse fixed an issue where we were silently dropping errors and not retrying. I believe we added retries up to 3 attempts, but maybe it's not enough, or only covered one error (I think was retrying on conflict errors): #1896

@joshdover
Copy link
Contributor

We discussed that this old PR may also be a potential workaround/fix for these transient issues by ensuring that the agent's version and upgrading status is always updated on every checkin: #1731

@juliaElastic
Copy link
Contributor Author

The error was a connection issue to ES when trying to create action result, as far as I saw the code didn't retry after this error at all.

@joshdover joshdover added the Team:Fleet Label for the Fleet team label May 3, 2023
@cmacknz
Copy link
Member

cmacknz commented May 3, 2023

I believe we added retries up to 3 attempts, but maybe it's not enough, or only covered one error (I think was retrying on conflict errors): #1896

The error was a connection issue to ES when trying to create action result, as far as I saw the code didn't retry after this error at all.

Let's do a thought experiment here. In the middle of an attempt to upgrade 10K agents, the Elasticsearch deployment for Fleet server has 20 minutes of continuous downtime. What happens to all of our upgrade actions? I don't consider "many agents get stuck in updating until the user manually intervenes" to be the right answer.

Note that I have used upgrades as an easy example, and some of the changes in https://github.com/elastic/ingest-dev/issues/1621 could be a bandaid for this specifically for upgrades. This would do absolutely nothing for any other action type however. What if instead of upgrades they were security response actions?

How do we ensure that any type of action is eventually acknowledged?

@joshdover
Copy link
Contributor

Agreed it'd be bandaid. We need to:

  1. Ensure a failed ack is properly propagated back to the client so it can be retried properly
  2. Re-think how acks happen in general. What I like about https://github.com/elastic/ingest-dev/issues/1621 is that we start to drive things through a single API / message exchange. This simplifies the agent implementation quite a bit and lends itself well to being replicated on a Websocket protocol in the future.

I think for this issue right now, we should focus on (1). I wouldn't worry so much about the server trying to have complex retry logic and just rely on the client's ability to keep track of unack'd actions to be retried.

In the future, I think we could change the checkin request format to include a list of unack'd actions from the client. The server can then respond once they're ack'd and the client can keep re-sending any that haven't been ack'd yet.

@cmacknz
Copy link
Member

cmacknz commented May 4, 2023

Agreed 1 should be the focus here. I am beginning to believe it may be worth trying to address the design of how acknowledgements work in general as part of the upgrade rework.

We could summarize a successful strategy here as:

  1. All errors in Fleet Server that cause an upgrade acknowledgement to fail are successfully reported back to the agent. The agent cannot be given a successful response if the action has not been confirmed to have been persisted to Elasticsearch.
  2. The agent persists unacknowledged actions and continues to retry them periodically.
  3. We have a way to observe the actions the agent has yet to acknowledge. For the upgrade this can include significant detail, for simpler actions it can just be the action ID.

This essentially is having the agent act like a persistent queue for unacknowledged actions. The system is supposed to work mostly like this today but it doesn't quite work and is very difficult to debug. I don't think it is a that much of a stretch to try to fix this as part of the upgrade work we already planning.

@cmacknz
Copy link
Member

cmacknz commented May 4, 2023

We also need a way for Fleet Server to confirm that the agent has successfully persisted an action it was sent. Otherwise it seems possible for the agent to restart immediately after receiving a checkin response with an action before it has enqueued it.

@joshdover
Copy link
Contributor

We also need a way for Fleet Server to confirm that the agent has successfully persisted an action it was sent.

I believe we already have this via the ack_token field during checkin. Agent should always send the ack_token of the last persisted action and Fleet Server should always deliver any newer ones than that.

@joshdover
Copy link
Contributor

joshdover commented May 5, 2023

I dug into the Agent code and AFAICT we never stop retrying failed acks, Horde should definitely do the same. Source: https://github.com/elastic/elastic-agent/blob/ceb3ca1d09c1a35a364f13a11e9a8e2643b861e2/internal/pkg/fleetapi/acker/lazy/lazy_acker.go#L87-L113

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 5, 2023

I found this, I thought it is defining 5 retries by default for action acks, or is it referring to something else? https://github.com/elastic/elastic-agent/blob/main/internal/pkg/fleetapi/acker/retrier/retrier.go#L19

It is not that trivial to remove the max retries from horde, currently it is doing a 20 retry in a for loop in handleAck. We can't indefinitely retry as it would block other actions from being acked I think.
https://github.com/elastic/horde/blob/main/internal/pkg/drone/action.go#L169

@joshdover
Copy link
Contributor

Seems there are multiple ackers and retries in here, I don't really understand the difference or the reason for the complexity. @pchila you've been in the Fleet gateway code recently. Can you help us understand the actual agent behavior on ack retries (ideally with manual testing)? Any idea what the reason is for this complexity?

For the horde problem, this seems to be another reason we'd benefit from having acks flow through the checkin endpoint. This way there's only a single loop for message exchange in horde.

@pchila
Copy link
Member

pchila commented May 8, 2023

@joshdover The Fleet Gateway doesn't concern itself with Acks: it will periodically try to check-in and retry indefinitely (with exponential backoff) until a checkin succeeds and it sends the resulting actions for processing (buffered channel of length 1) or the fleet gateway has to shutdown.

Disclaimer: I am not too familiar with this part of the code and all that follows comes from what I extrapolated from looking at the code so take it with a pinch of salt

The actions resulting from a check-in are passed to the dispatcher via an output channel of the fleet gateway and the acks for those actions are sent after an action is dispatched and handled by the appropriate handler (there's a bit of concurrency here since the output channel is buffered).
The wiring for the Fleet Gateway, the ackers and the retriers can be found in managed_mode.go.
The retry mechanism itself is defined in retrier.go and from what I can see in the code, the retry will happen when enqueuing actions in the retrier (seems to me that we keep trying until we exhaust all the actions we have to ack).

The runDispatcher() function has a loop to flush dispatched actions periodically using a timer as well as reading from the Fleet Gateway output channel.

Note that the acker being used is a decorated object composed of:

  • the inner, undecorated fleetapi.Acker
  • the lazy.Acker() which will collect single action Acks and try to acknowledge all when Commit() is called using the decorated inner Acker. In case of failure it will use the retrier to start a loop to retry failed acks
  • the outer StateStoreAcker which will persist policy change actions to disk after they've been acked (to allow the agent to start with the last acknowledged policy)

I didn't have the time to test it manually though and honestly I am not sure why it's been implemented this way.

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 8, 2023

While debugging this, I encountered another issue where the drone is stuck in updating after upgrade.
For 2 drones out of 5000, I see this in the logs:
Agent is not currently connected. Not dispatching actions.

And it is not followed by any action delivered messages, even though the drone does successful checkins after.

The action was started at 2023-05-08T12:27:11.715Z with a 10m rollout period.

Fleet Server logs:
https://logging.us-west2.gcp.elastic-cloud.com/app/r/s/u2wD8
https://admin.found.no/deployments/b7634a61d2cdfbfa88629b9a2df305f9/integrations_server
Agent id: aa96fe4c-7956-43d3-8855-48da64951965
Drone name: eh-Goosetopaz-th2W
There was a failed checkin by the drone at 2023-05-08T12:37:52.332Z, though that was 10m later than the "Not dispatching actions"message.

Traces: https://ingest-metricstore.kb.us-central1.gcp.cloud.es.io:9243/app/apm/services/horde-ehswarm/transactions/view?kuery=labels.drone_name:%22eh-Goosetopaz-th2W%22&rangeFrom=now-60m&rangeTo=now&environment=ENVIRONMENT_ALL&serviceGroup=&comparisonEnabled=true&transactionName=Drone%20checkin%20with%20Fleet&transactionType=drone&offset=1d&latencyAggregationType=avg&score=0&timestamp=2023-05-08T13:22:00.768Z&transactionId=121b17b1059a00a4&traceId=121b17b1059a00a477f1902a7979f8b9&detailTab=timeline&flyoutDetailTab=&waterfallItemId=

EDIT: I think this issue only happens if I trigger upgrade right after enroll, because it doesn't happen on CI run after the other steps. I think it might have to do something with action_seq_no being 0.
It seems like on checkin, fetchAgentPendingActions doesn't return the upgrade action that was not delivered previously.

@joshdover
Copy link
Contributor

@juliaElastic what's your plan to close this one out? From my reading of this so far we should:

  • Ensure Fleet Server doesn't swallow any errors during ack, they need to be returned to the client
  • Fix the query for the action_seq_no being 0

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 9, 2023

I was trying to reproduce the original issue in the description, to see why the agent didn't retry if the action result failed to persist. It is hard to reproduce, I might have to hardcode an error to see what agent does.

With the seq_no issue, I think it will be easy to fix for the 0 special case, trying to come up with a way to test it, not to rely on the 5k run where it doesn't always happen.

I could reproduce the seq_no issue locally with 1 drone:

  • Start a fresh ES, kibana, fleet-server (no actions created before)
  • Enroll 1 drone with horde
  • restart fleet-server
  • trigger an upgrade action, the action is pesristed with _seq_no:0
  • wait for the drone to checkin, it doesn't receive the action, stuck in updating.

To check the _seq_no field of actions:

GET .fleet-actions/_search
{
   "seq_no_primary_term": true
}

Seeing in Fleet Server logs (added log to print the pending actions query):

�[90m14:04:07.256�[0m �[33mDBG�[0m Agent is not currently connected. Not dispatching actions. �[36mecs.version=�[0m1.6.0 �[36mfleet.agent.id=�[0m17a7692e-e490-4ed8-a9c7-6cd457ff40b0 �[36mservice.name=�[0mfleet-server
�[90m14:04:30.308�[0m �[33mDBG�[0m HTTP start �[36mclient.address=�[0m192.168.178.119:64743 �[36mclient.ip=�[0m192.168.178.119 �[36mclient.port=�[0m64743 �[36mecs.version=�[0m1.6.0 �[36mfleet.apikey.id=�[0mDgbRAIgByct60_Ji9WVQ �[36mhttp.request.method=�[0mPOST �[36mhttp.version=�[0m1.1 �[36mservice.name=�[0mfleet-server �[36mtls.cipher=�[0mTLS_AES_128_GCM_SHA256 �[36mtls.established=�[0mtrue �[36mtls.resumed=�[0mfalse �[36mtls.version=�[0m1.3 �[36murl.full=�[0m/api/fleet/agents/17a7692e-e490-4ed8-a9c7-6cd457ff40b0/checkin
[90m14:04:30.571�[0m �[33mDBG�[0m findActionsHits �[36mecs.version=�[0m1.6.0 �[36mquery=�[0m"{\"_source\":{\"excludes\":[\"agents\"]},\"query\":{\"bool\":{\"filter\":[{\"range\":{\"_seq_no\":{\"gt\":0}}},{\"range\":{\"_seq_no\":{\"lte\":0}}},{\"range\":{\"expiration\":{\"gt\":\"2023-05-09T14:04:30Z\"}}},{\"terms\":{\"agents\":[\"17a7692e-e490-4ed8-a9c7-6cd457ff40b0\"]}}]}},\"seq_no_primary_term\":true,\"size\":100,\"sort\":[\"_seq_no\"]}" �[36mservice.name=�[0mfleet-server

Agent stuck in updating:
image

@cmacknz
Copy link
Member

cmacknz commented May 9, 2023

I was trying to reproduce the original issue in the description, to see why the agent didn't retry if the action result failed to persist. It is hard to reproduce, I might have to hardcode an error to see what agent does.

Related to the testing aspect of this, we want to build a mock Fleet server purely for testing agent: elastic/elastic-agent#2630. This would give us complete control over the checking and ack responses returned to the agent without needing to figure out how to force errors or particular timings with the real system.

@juliaElastic
Copy link
Contributor Author

I found where is the bug with the seq_no, raised a fix: #2582

@juliaElastic
Copy link
Contributor Author

juliaElastic commented May 12, 2023

Simulated the original error locally by throwing an error always instead of writing out action result here.

What I observed is that the drone acks the action (action seq_no:1, agent seq_no:1) and it is stuck in updating.
However I noticed in the fleet server logs that the action keeps retrying.
Changed the dummy error to randomly throw in 50% of the cases, and I see the drones eventually completing the action successfully.
I don't fully understand why this works, it could be that the fix of seq_no above fixed this too.
Originally in the perf test runs I didn't see the action result error retrying.

I'll keep testing with the perf runs to see if the drones ever get stuck in updating with the fix custom image.

@juliaElastic
Copy link
Contributor Author

Merged the fix, and ran a 25k test with an image containing the change: https://buildkite.com/elastic/observability-perf/builds/940#0188238a-642c-4d1f-88ea-3052351a82a0

What I'm seeing in upgrade and unenroll, is that 3 acks seem to be missing, but the drones were actually upgraded and then unenrolled. Since the drones are in a right state, the missing ack agent ids are not logged, so it is difficult to investigate. We should be logging out the agents missing acks in this case (by querying all agents in the action and comparing it with a list of acked agents in action results).
The only issue here is that Agent activity shows these actions as not completed.

@joshdover
Copy link
Contributor

Your last two observations seems consistent with Fleet Server silently ignoring some error and not returning the result back to the drone correctly. In that case the drone wouldn't retry and the ack wouldn't be written.

@juliaElastic
Copy link
Contributor Author

I wrote a script to query all agent ids from action results to find the missing acks above, and surprisingly I didn't find any.
So now I have a suspicion that the the /action_status query has a bug and not returning all acks.
I'll look into it more, I think this issue is good to close.

Querying all .fleet-action-results hits with searchAfter and comparing the agent ids with those in .fleet-actions to find which agents haven't acked.

{
  action_id: '9bd343a3-9b0f-4daa-86b6-8ba27c2f8821',
  type: 'FORCE_UNENROLL'
}
agents actioned: 25000
fleet-action-results hits: 10000
fleet-action-results hits: 10000
fleet-action-results hits: 5000
action results: 25000
missingAcks: 0
[]

Result of /action_status which counts agent ids with aggregation (should be precize up to 40k), shows 3 missing acks.

    {
      actionId: '9bd343a3-9b0f-4daa-86b6-8ba27c2f8821',
      nbAgentsActionCreated: 25000,
      nbAgentsAck: 24997,
      type: 'FORCE_UNENROLL',
      nbAgentsActioned: 25000,
      status: 'IN_PROGRESS',
      expiration: '2023-06-15T10:07:14.918Z',
      creationTime: '2023-05-16T10:07:14.918Z',
      nbAgentsFailed: 0,
      hasRolloutPeriod: false,
      completionTime: '2023-05-16T10:07:15.391Z',
      latestErrors: []
    },

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Project:FleetScaling Team:Fleet Label for the Fleet team
Projects
None yet
Development

No branches or pull requests

4 participants