-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic Agent can get stuck in Updating state on Fleet upgrade #828
Comments
Pinging @elastic/fleet (Team:Fleet) |
Transferring to Fleet Server, maybe |
I was unable to recreate this on ubuntu 20.04 using |
Closing for now following @michel-laterman comment.. |
@jlind23 I'm going from 8.3.1 to 8.4.1 and still experience this. I'm getting around it with the force command but I have 600 agents stuck updating at this time. Restarting the agent doesn't fix it and only the force upgrade does. |
@KnowMoreIT we do have this issue that will fix the problem |
Kibana version: 7.15.1
Describe the bug:
Elastic Agents can operate in unstable environment with network/DNS issues. When Elastic Agent upgrade is initiated in Fleet mode, Elastic Agent can fail to ack upgrade success to Fleet server causing constant
Updating
state as reported in Kibana. This is confusing for end user as it suggests there's ongoing upgrade, which is not the case. It's also not obvious how to recover because upgrade action is only available if agent reports not the latest version.Steps to reproduce:
/etc/hosts
on the agent host to resolve Fleet server hostname to unreachable IP address:Healthy
toUpdating
/etc/hosts
entry to restore correct IP address resolutionUpdating
state never changes back toHealthy
The following error can be noticed in agent logs:
[elastic_agent][warn] failed to ack update acknowledge action 'aec922d1-6591-4f69-90a1-3a6b7fa890d9' for elastic-agent 'c1fa118a-9676-446f-a025-154d2c42069b' failed: fail to ack to fleet: Post "https://<fleet-cluster-ID>.fleet.<region>.<cloud-provider>.elastic-cloud.com:443/api/fleet/agents/c1fa118a-9676-446f-a025-154d2c42069b/acks?": dial tcp 198.51.100.1:443: connect: connection timed out
Expected behavior:
The agent should ultimately be reported as
Healthy
if the upgrade was successful without any manual interventions (see workarounds).Screenshots (if relevant):
Workarounds:
If an agent is restarted manually with
elastic-agent restart
on the host within 15 (?) minutes from the upgrade (and there are no more network/DNS issues), the successful upgrade will be reported and state shown by Kibana will flip fromUpdating
toHealthy
.Alternatively, an agent can be forced to upgrade again with the following API call:
Other comments:
Elastic Agent communicates with 3 entities during the upgrade - Fleet server endpoint, Elasticsearch endpoint and
artifacts.elastic.co
(to download binaries). The above scenario shows a failure with Fleet server endpoint only. It's possible other stuck states are possible if different combination of network/DNS failures occurs.The text was updated successfully, but these errors were encountered: