Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elastic-agent installed as Fleet server remains stuck in uninstalling with retry errors for over 4-5 minutes. #5752

Closed
amolnater-qasource opened this issue Oct 10, 2024 · 11 comments · Fixed by #5756 or #6085
Assignees
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Comments

@amolnater-qasource
Copy link

Kibana Build details:

VERSION: 8.16.0 SNAPSHOT
BUILD: 78993
COMMIT: 6eb8471c3124046eca03cccf20e0cc4f9706bcd5

Artifact: https://snapshots.elastic.co/8.16.0-106cdbc2/downloads/beats/elastic-agent/elastic-agent-8.16.0-SNAPSHOT-windows-x86_64.zip

Host: Windows Server 2022- Test Signing ON

Preconditions:

  1. 8.16.0 SNAPSHOT Kibana self-managed environment should be available.
  2. Agent should be installed using Quickstart or Production mode of security.

Steps to reproduce:

  1. Observe Fleet Server is installed.
  2. Run .\Elastic\Agent\elastic-agent.exe uninstall.
  3. Observe notify: Fleet network error[retry] error is stuck for more than 4 minutes.
  4. Observe folder is already blank before the completion.

Expected Result:
elastic-agent installed as Fleet server should not remain stuck in uninstalling with retry errors for over 4-5 minutes.

Screenshots:
Image
Image
Image
Image
Image

@amolnater-qasource amolnater-qasource added bug Something isn't working impact:medium Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Oct 10, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@amolnater-qasource
Copy link
Author

@muskangulati-qasource Please review.

@muskangulati-qasource
Copy link

Secondary review is Done for this ticket!

@cmacknz
Copy link
Member

cmacknz commented Oct 10, 2024

@michel-laterman this is probably from the uninstall notification, at minimum there should be better error messages here if Fleet Server was legitimately unavailable.

@amolnater-qasource
Copy link
Author

Hi Team,
We have revalidated this issue on latest 8.16.0 BC1 kibana cloud environment and had below observations:

Observations:

  • Multiple error logs are observed consistently on running uninstall command for installed fleet-server.

Screen Recording:

Amol.Self.WindowsBC.-.ec2-35-174-4-132.compute-1.amazonaws.com.-.Remote.Desktop.Connection.2024-10-24.11-22-29.mp4
Amol.Self.WindowsBC.-.ec2-35-174-4-132.compute-1.amazonaws.com.-.Remote.Desktop.Connection.2024-10-24.12-00-38.mp4

Build details:
VERSION: 8.16.0 BC1
BUILD: 79314
COMMIT: 5575428dd3aef69366cddb4ccf07a2a26d30ce48
Artifact Link: https://staging.elastic.co/8.16.0-e8d5928a/downloads/beats/elastic-agent/elastic-agent-8.16.0-windows-x86_64.zip

Hence, we are reopening this issue.

Thanks!!

@michel-laterman
Copy link
Contributor

Hi @amolnater-qasource,

For the failures you posted:

Did you destroy the ES deployment before uninstalling the agent instance? If so the 1st instances failure messages are expected; it should have taken a shorter amount of time then 4m as originally reported. do you recall how long it took?

Is the panic that occured in the 2nd scenario common? Do you have steps on how to recreate? I would like to recreate this with a binary where the symbols are still present because the failure when trying to marshall json seems strange to me.

@amolnater-qasource
Copy link
Author

Hi @michel-laterman

Thank you for looking into this issue.

Did you destroy the ES deployment before uninstalling the agent instance? If so the 1st instances failure messages are expected; it should have taken a shorter amount of time then 4m as originally reported. do you recall how long it took?

Please find below steps for this:

  • Install Fleet Server on kibana cloud environment.
  • Keep it running for sometime.
  • Simply run uninstall command: C:\"Program Files"\Elastic\Agent\elastic-agent.exe uninstall
  • Observe multiple log lines in the powershell.
  • Note: ES deployment isn't destroyed.

The issue is reproducible for 8.17.0-SNAPSHOT agent.

Screen Recording:

Amol.Windows.2025.-.ec2-18-212-146-218.compute-1.amazonaws.com.-.Remote.Desktop.Connection.2024-11-19.15-32-00.mp4

Is the panic that occured in the 2nd scenario common? Do you have steps on how to recreate? I would like to recreate this with a binary where the symbols are still present because the failure when trying to marshall json seems strange to me.

This issue is inconsistently observed when VM is shut down for a long time and then agent is uninstalled.
This issue is reproduced multiple times, however no consistent steps are available.

Build details:
VERSION: 8.17.0 SNAPSHOT
BUILD: 80188
COMMIT: fdb16ae8cbdf4236db3696aa00d0bb98c943d864
Artifact Link: https://snapshots.elastic.co/8.17.0-0c51e25e/downloads/beats/elastic-agent/elastic-agent-8.17.0-SNAPSHOT-windows-x86_64.zip

Please let us know if anything else is required from our end.
Thanks!

@michel-laterman
Copy link
Contributor

panics are being tracked in their own issue: #5952

@michel-laterman
Copy link
Contributor

michel-laterman commented Nov 19, 2024

This issue occurs if fleet-server is running locally.
It occurs because the components have been stopped by that part of the uninstall procedure, and the elastic-agent will only try to contact a local fleet-server instance if one is a part of the running policy.

The ways we can fix it are either:

  1. move uninstall notification before components are stopped (and have an edge case where the notification goes through, but the uninstall fails)
  2. read the policy to get other fleet-server hosts and attempt to notify them (and have an edge case that occurs when the last fleet-server is uninstalled)

@blakerouse
Copy link
Contributor

  1. move uninstall notification before components are stopped (and have an edge case where the notification goes through, but the uninstall fails)

I would go with this except I would change the logic of performing it before only in the case that this Elastic Agent is running the Fleet Server. That means the rare case that it notifies before uninstall actually happens only occurs on hosts that are running on Fleet Server. This reduces the likely hood of a failure after notify to only occur on hosts that are running a Fleet Server which is few.

@amolnater-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.18.0 SNAPSHOT and found it fixed now.

Observations:

  • elastic-agent installed as Fleet server gets uninstalled without any errors.

Screenshot:
Image

Build details:
VERSION: 8.18.0 SNAPSHOT
BUILD: 81228
COMMIT: 9d6cc0792e538a076d68ffcfabbf6551912fb24e
Artifact Link: https://snapshots.elastic.co/8.18.0-94ccacd2/downloads/beats/elastic-agent/elastic-agent-8.18.0-SNAPSHOT-windows-x86_64.zip

Hence, we are marking this issue as QA:Validated.

Thanks!!

@amolnater-qasource amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
6 participants