-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade action sometimes leaves agent in Updating state #2519
Comments
Let's not couple this to the https://github.com/elastic/ingest-dev/issues/1621 as that is likely to be smaller in scope to begin with. Do you have any logs of what the error was? I remember @blakerouse fixed an issue where we were silently dropping errors and not retrying. I believe we added retries up to 3 attempts, but maybe it's not enough, or only covered one error (I think was retrying on conflict errors): #1896 |
We discussed that this old PR may also be a potential workaround/fix for these transient issues by ensuring that the agent's version and upgrading status is always updated on every checkin: #1731 |
The error was a connection issue to ES when trying to create action result, as far as I saw the code didn't retry after this error at all. |
Let's do a thought experiment here. In the middle of an attempt to upgrade 10K agents, the Elasticsearch deployment for Fleet server has 20 minutes of continuous downtime. What happens to all of our upgrade actions? I don't consider "many agents get stuck in updating until the user manually intervenes" to be the right answer. Note that I have used upgrades as an easy example, and some of the changes in https://github.com/elastic/ingest-dev/issues/1621 could be a bandaid for this specifically for upgrades. This would do absolutely nothing for any other action type however. What if instead of upgrades they were security response actions? How do we ensure that any type of action is eventually acknowledged? |
Agreed it'd be bandaid. We need to:
I think for this issue right now, we should focus on (1). I wouldn't worry so much about the server trying to have complex retry logic and just rely on the client's ability to keep track of unack'd actions to be retried. In the future, I think we could change the checkin request format to include a list of unack'd actions from the client. The server can then respond once they're ack'd and the client can keep re-sending any that haven't been ack'd yet. |
Agreed 1 should be the focus here. I am beginning to believe it may be worth trying to address the design of how acknowledgements work in general as part of the upgrade rework. We could summarize a successful strategy here as:
This essentially is having the agent act like a persistent queue for unacknowledged actions. The system is supposed to work mostly like this today but it doesn't quite work and is very difficult to debug. I don't think it is a that much of a stretch to try to fix this as part of the upgrade work we already planning. |
We also need a way for Fleet Server to confirm that the agent has successfully persisted an action it was sent. Otherwise it seems possible for the agent to restart immediately after receiving a checkin response with an action before it has enqueued it. |
I believe we already have this via the |
I dug into the Agent code and AFAICT we never stop retrying failed acks, Horde should definitely do the same. Source: https://github.com/elastic/elastic-agent/blob/ceb3ca1d09c1a35a364f13a11e9a8e2643b861e2/internal/pkg/fleetapi/acker/lazy/lazy_acker.go#L87-L113 |
I found this, I thought it is defining 5 retries by default for action acks, or is it referring to something else? https://github.com/elastic/elastic-agent/blob/main/internal/pkg/fleetapi/acker/retrier/retrier.go#L19 It is not that trivial to remove the max retries from horde, currently it is doing a 20 retry in a for loop in |
Seems there are multiple ackers and retries in here, I don't really understand the difference or the reason for the complexity. @pchila you've been in the Fleet gateway code recently. Can you help us understand the actual agent behavior on ack retries (ideally with manual testing)? Any idea what the reason is for this complexity? For the horde problem, this seems to be another reason we'd benefit from having acks flow through the checkin endpoint. This way there's only a single loop for message exchange in horde. |
@joshdover The Fleet Gateway doesn't concern itself with Acks: it will periodically try to check-in and retry indefinitely (with exponential backoff) until a checkin succeeds and it sends the resulting actions for processing (buffered channel of length 1) or the fleet gateway has to shutdown. Disclaimer: I am not too familiar with this part of the code and all that follows comes from what I extrapolated from looking at the code so take it with a pinch of salt The actions resulting from a check-in are passed to the dispatcher via an output channel of the fleet gateway and the acks for those actions are sent after an action is dispatched and handled by the appropriate handler (there's a bit of concurrency here since the output channel is buffered). The Note that the acker being used is a decorated object composed of:
I didn't have the time to test it manually though and honestly I am not sure why it's been implemented this way. |
While debugging this, I encountered another issue where the drone is stuck in updating after upgrade. And it is not followed by any action delivered messages, even though the drone does successful checkins after. The action was started at Fleet Server logs: EDIT: I think this issue only happens if I trigger upgrade right after enroll, because it doesn't happen on CI run after the other steps. I think it might have to do something with |
@juliaElastic what's your plan to close this one out? From my reading of this so far we should:
|
I was trying to reproduce the original issue in the description, to see why the agent didn't retry if the action result failed to persist. It is hard to reproduce, I might have to hardcode an error to see what agent does. With the seq_no issue, I think it will be easy to fix for the 0 special case, trying to come up with a way to test it, not to rely on the 5k run where it doesn't always happen. I could reproduce the
To check the
Seeing in Fleet Server logs (added log to print the pending actions query):
|
Related to the testing aspect of this, we want to build a mock Fleet server purely for testing agent: elastic/elastic-agent#2630. This would give us complete control over the checking and ack responses returned to the agent without needing to figure out how to force errors or particular timings with the real system. |
I found where is the bug with the seq_no, raised a fix: #2582 |
Simulated the original error locally by throwing an error always instead of writing out action result here. What I observed is that the drone acks the action (action seq_no:1, agent seq_no:1) and it is stuck in updating. I'll keep testing with the perf runs to see if the drones ever get stuck in updating with the fix custom image. |
Merged the fix, and ran a 25k test with an image containing the change: https://buildkite.com/elastic/observability-perf/builds/940#0188238a-642c-4d1f-88ea-3052351a82a0 What I'm seeing in upgrade and unenroll, is that 3 acks seem to be missing, but the drones were actually upgraded and then unenrolled. Since the drones are in a right state, the missing ack agent ids are not logged, so it is difficult to investigate. We should be logging out the agents missing acks in this case (by querying all agents in the action and comparing it with a list of acked agents in action results). |
Your last two observations seems consistent with Fleet Server silently ignoring some error and not returning the result back to the drone correctly. In that case the drone wouldn't retry and the ack wouldn't be written. |
I wrote a script to query all agent ids from action results to find the missing acks above, and surprisingly I didn't find any. Querying all
Result of
|
While investigating the issue of some horde drones converging slowly during an upgrade, encountered an issue where a few drones were left in Updating state indefinitely (no change in a few hours).
When looking at the logs, I saw that the action was delivered, and during ack it encountered an error when writing out action result (temporary connection issue with ES): https://github.com/elastic/fleet-server/blob/main/internal/pkg/api/handleAck.go#L323
It seems that when hitting this error, the upgrade action didn't continue and didn't retry, the agent was left in
Updating
state.This could be solved by the redesign of upgrade action https://github.com/elastic/ingest-dev/issues/1621
The text was updated successfully, but these errors were encountered: