Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Agent on Cloud] Fleet Server ends up shut down by Agent, so Cloud hosted Fleet Server is not started, can't use cloud #26588

Closed
EricDavisX opened this issue Jun 29, 2021 · 11 comments
Assignees
Labels

Comments

@EricDavisX
Copy link
Contributor

I’m seeing a problem with Fleet Server on 7.14 cloud in cloud-staging. It [Fleet Server Agent] isn’t standing up on its own [the rest of cloud env seems fine] … is seems not known so I am logging it.

This is reproduced on latest 7.14 snapshot as of Jun 29 4PM

the kibana hash is: dcacd04872050ff322f7e9bb36af913e40d5977e
which should give us the timing for the whole stack and Agent...
edavis-mbp:kibana_elastic edavis$ git show -s dcacd04872050ff322f7e9bb36af913e40d5977e
commit dcacd04872050ff322f7e9bb36af913e40d5977e
Author: Kibana Machine [email protected]
Date: Tue Jun 29 00:33:06 2021 -0400

reproduced by using defaults in cloud staging, and picking 7.14-snapshot to deploy

From the Kibana UI, it manifests as the APM/Fleet container just isn't set up (although it is):

Screen Shot 2021-06-29 at 4 23 32 PM


Brief conversation with Alex P from cloud team helped us find some logs which seem to indicate the problem needs review on Agent / Beats side.

Notes from slack, logs:
Alex Piggott 13 minutes ago
Failed to connect to backoff(elasticsearch(http://7cd47f69212147abb63f979fe801cd88.containerhost:9244)): Connection marked as failed because the onConnect callback failed: resource 'apm-7.14.0-transaction' exists, but it is not an alias

Alex Piggott 11 minutes ago
i assume that’s an unrelated issue?
oh wait wrong logs that’s APM

Alex Piggott 9 minutes ago
2021-06-29T17:08:03Z - message: Application: fleet-server--7.14.0-SNAPSHOT[]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'

Alex Piggott 9 minutes ago
so fleet server is stopped by agent

for this reason may be: 2021-06-29T17:08:02Z - message: Application: fleet-server--7.14.0-SNAPSHOT[]: State changed to DEGRADED: Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud; missing config fleet.agent.id (expected during bootstrap process) - type: 'STATE' - sub_type: 'RUNNING'

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 29, 2021
@EricDavisX EricDavisX added the Team:Elastic-Agent Label for the Agent team label Jun 29, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 29, 2021
@blakerouse
Copy link
Contributor

I believe this is because of my recent change for HTTP2

@blakerouse
Copy link
Contributor

Running the container locally shows that it crashes as it fails to enroll.

agent_1          | Error: fail to enroll: fail to execute request to fleet-server: unexpected EOF
agent_1          | Error: enrollment failed: exit status 1

@blakerouse blakerouse self-assigned this Jun 29, 2021
@blakerouse
Copy link
Contributor

My change has basically been completely re-done by @urso in #25219 so I need to check to see if that change actually fixes it or if its still broken with that change.

@tobio
Copy link
Member

tobio commented Jun 29, 2021

I've been digging into this from another direction. The current 7.14-SNAPSHOT container cannot be successfully created in Cloud QA (or Cloud master). In https://github.com/elastic/cloud/pull/83408 we have changed the container health check to use the agent :6791/processes endpoint.

In the latest 7.14-SNAPSHOT this API appears to be unresponsive and so the container never passes the health check.

@amolnater-qasource
Copy link

Hi @EricDavisX
We are blocked to continue test on cloud build as this issue is reproducible at our end.

Thanks
QAS

@blakerouse
Copy link
Contributor

Validated and confirmed that #25219 fixes it in master, just waiting for green test run in 7.x for backport and then this will be fixed.

@blakerouse
Copy link
Contributor

Fixed by #25219 and #26587

@blakerouse
Copy link
Contributor

Seems that even with those fixes applied in 7.14, I am still seeing the following error:

agent_1          | Error: fail to enroll: fail to execute request to fleet-server: unexpected EOF
agent_1          | Error: enrollment failed: exit status 1

@blakerouse blakerouse reopened this Jul 1, 2021
@blakerouse
Copy link
Contributor

I re-opened this issue because I kept getting the same issue that I commented above, that was because I was starting the same broken container image each time (user error on my part). With a build docker image from the 7.14 branch of the beats repo with a 7.14 fleet-server included in the bundle, the container starts up correctly.

@amolnater-qasource
Copy link

Hi @EricDavisX
We have revalidated this on 7.14.0 cloud-qa build and found it fixed now.

  • We are now able install elastic-agent on 7.14.0 BC-1.

Thanks
QAS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants