[Elastic Agent on Cloud] Fleet Server ends up shut down by Agent, so Cloud hosted Fleet Server is not started, can't use cloud #26588

EricDavisX · 2021-06-29T20:25:19Z

I’m seeing a problem with Fleet Server on 7.14 cloud in cloud-staging. It [Fleet Server Agent] isn’t standing up on its own [the rest of cloud env seems fine] … is seems not known so I am logging it.

This is reproduced on latest 7.14 snapshot as of Jun 29 4PM

the kibana hash is: dcacd04872050ff322f7e9bb36af913e40d5977e
which should give us the timing for the whole stack and Agent...
edavis-mbp:kibana_elastic edavis$ git show -s dcacd04872050ff322f7e9bb36af913e40d5977e
commit dcacd04872050ff322f7e9bb36af913e40d5977e
Author: Kibana Machine [email protected]
Date: Tue Jun 29 00:33:06 2021 -0400

reproduced by using defaults in cloud staging, and picking 7.14-snapshot to deploy

From the Kibana UI, it manifests as the APM/Fleet container just isn't set up (although it is):

Brief conversation with Alex P from cloud team helped us find some logs which seem to indicate the problem needs review on Agent / Beats side.

Notes from slack, logs:
Alex Piggott 13 minutes ago
Failed to connect to backoff(elasticsearch(http://7cd47f69212147abb63f979fe801cd88.containerhost:9244)): Connection marked as failed because the onConnect callback failed: resource 'apm-7.14.0-transaction' exists, but it is not an alias

Alex Piggott 11 minutes ago
i assume that’s an unrelated issue?
oh wait wrong logs that’s APM

Alex Piggott 9 minutes ago
2021-06-29T17:08:03Z - message: Application: fleet-server--7.14.0-SNAPSHOT[]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'

Alex Piggott 9 minutes ago
so fleet server is stopped by agent

for this reason may be: 2021-06-29T17:08:02Z - message: Application: fleet-server--7.14.0-SNAPSHOT[]: State changed to DEGRADED: Running on policy with Fleet Server integration: policy-elastic-agent-on-cloud; missing config fleet.agent.id (expected during bootstrap process) - type: 'STATE' - sub_type: 'RUNNING'

elasticmachine · 2021-06-29T20:37:34Z

Pinging @elastic/agent (Team:Agent)

blakerouse · 2021-06-29T20:43:18Z

I believe this is because of my recent change for HTTP2

blakerouse · 2021-06-29T20:43:44Z

Running the container locally shows that it crashes as it fails to enroll.

agent_1          | Error: fail to enroll: fail to execute request to fleet-server: unexpected EOF
agent_1          | Error: enrollment failed: exit status 1

blakerouse · 2021-06-29T20:46:08Z

My change has basically been completely re-done by @urso in #25219 so I need to check to see if that change actually fixes it or if its still broken with that change.

tobio · 2021-06-29T23:50:17Z

I've been digging into this from another direction. The current 7.14-SNAPSHOT container cannot be successfully created in Cloud QA (or Cloud master). In https://github.com/elastic/cloud/pull/83408 we have changed the container health check to use the agent :6791/processes endpoint.

In the latest 7.14-SNAPSHOT this API appears to be unresponsive and so the container never passes the health check.

amolnater-qasource · 2021-06-30T09:03:53Z

Hi @EricDavisX
We are blocked to continue test on cloud build as this issue is reproducible at our end.

Thanks
QAS

blakerouse · 2021-06-30T15:28:17Z

Validated and confirmed that #25219 fixes it in master, just waiting for green test run in 7.x for backport and then this will be fixed.

blakerouse · 2021-06-30T18:09:44Z

Fixed by #25219 and #26587

blakerouse · 2021-07-01T17:40:39Z

Seems that even with those fixes applied in 7.14, I am still seeing the following error:

agent_1          | Error: fail to enroll: fail to execute request to fleet-server: unexpected EOF
agent_1          | Error: enrollment failed: exit status 1

blakerouse · 2021-07-02T02:08:25Z

I re-opened this issue because I kept getting the same issue that I commented above, that was because I was starting the same broken container image each time (user error on my part). With a build docker image from the 7.14 branch of the beats repo with a 7.14 fleet-server included in the bundle, the container starts up correctly.

amolnater-qasource · 2021-07-05T11:56:44Z

Hi @EricDavisX
We have revalidated this on 7.14.0 cloud-qa build and found it fixed now.

We are now able install elastic-agent on 7.14.0 BC-1.

Thanks
QAS

EricDavisX added blocker v7.14.0 labels Jun 29, 2021

EricDavisX assigned andresrc Jun 29, 2021

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 29, 2021

EricDavisX added the Team:Elastic-Agent Label for the Agent team label Jun 29, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 29, 2021

blakerouse self-assigned this Jun 29, 2021

blakerouse closed this as completed Jun 30, 2021

jen-huang mentioned this issue Jul 1, 2021

[Fleet] Better onboarding experience for Fleet Server on premise elastic/kibana#103550

Merged

blakerouse reopened this Jul 1, 2021

blakerouse closed this as completed Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Elastic Agent on Cloud] Fleet Server ends up shut down by Agent, so Cloud hosted Fleet Server is not started, can't use cloud #26588

[Elastic Agent on Cloud] Fleet Server ends up shut down by Agent, so Cloud hosted Fleet Server is not started, can't use cloud #26588

EricDavisX commented Jun 29, 2021

elasticmachine commented Jun 29, 2021

blakerouse commented Jun 29, 2021

blakerouse commented Jun 29, 2021

blakerouse commented Jun 29, 2021

tobio commented Jun 29, 2021

amolnater-qasource commented Jun 30, 2021

blakerouse commented Jun 30, 2021

blakerouse commented Jun 30, 2021

blakerouse commented Jul 1, 2021

blakerouse commented Jul 2, 2021

amolnater-qasource commented Jul 5, 2021

[Elastic Agent on Cloud] Fleet Server ends up shut down by Agent, so Cloud hosted Fleet Server is not started, can't use cloud #26588

[Elastic Agent on Cloud] Fleet Server ends up shut down by Agent, so Cloud hosted Fleet Server is not started, can't use cloud #26588

Comments

EricDavisX commented Jun 29, 2021

elasticmachine commented Jun 29, 2021

blakerouse commented Jun 29, 2021

blakerouse commented Jun 29, 2021

blakerouse commented Jun 29, 2021

tobio commented Jun 29, 2021

amolnater-qasource commented Jun 30, 2021

blakerouse commented Jun 30, 2021

blakerouse commented Jun 30, 2021

blakerouse commented Jul 1, 2021

blakerouse commented Jul 2, 2021

amolnater-qasource commented Jul 5, 2021