Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.1.0-SNAPSHOT] Fleet Server can't enroll: FAILED: Missed two check-ins #1129

Closed
mtojek opened this issue Feb 3, 2022 · 32 comments · Fixed by elastic/beats#30197
Closed
Assignees
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team

Comments

@mtojek
Copy link
Contributor

mtojek commented Feb 3, 2022

Hi,

we adopted elastic-package to use predefined agent policies and confirmed with @juliaElastic that we're ready for switch (main branch is green).

Since yesterday we're facing problems with enrollment:

Attaching to elastic-package-stack_fleet-server_1
�[36mfleet-server_1               |�[0m Performing setup of Fleet in Kibana
�[36mfleet-server_1               |�[0m 
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:21.515Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":587},"message":"Spawning Elastic Agent daemon as a subprocess to complete bootstrap process.","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:21.712Z","log.origin":{"file.name":"application/application.go","file.line":78},"message":"Detecting execution mode","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:21.713Z","log.origin":{"file.name":"application/application.go","file.line":98},"message":"Agent is in Fleet Server bootstrap mode","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:22.031Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":62},"message":"Starting stats endpoint","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:22.031Z","log.origin":{"file.name":"application/fleet_server_bootstrap.go","file.line":134},"message":"Agent is starting","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:22.031Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":64},"message":"Metrics endpoint listening on: /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock (configured: unix:///usr/share/elastic-agent/state/data/tmp/elastic-agent.sock)","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:22.034Z","log.origin":{"file.name":"application/fleet_server_bootstrap.go","file.line":144},"message":"Agent is stopped","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:22.141Z","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":48},"message":"New State ID is Wb5PhdQX","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:22.142Z","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":49},"message":"Converging state requires execution of 1 step(s)","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:22.733Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-02-03T08:05:22Z - message: Application: fleet-server--8.1.0-SNAPSHOT[]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:22.735Z","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":66},"message":"Updating internal state","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:05:24.523Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":792},"message":"Fleet Server - Starting","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"warn","@timestamp":"2022-02-03T08:06:27.040Z","log.origin":{"file.name":"status/reporter.go","file.line":236},"message":"Elastic Agent status changed to: 'degraded'","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:06:27.040Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-02-03T08:06:27Z - message: Application: fleet-server--8.1.0-SNAPSHOT[]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:07:21.517Z","log.origin":{"file.name":"cmd/run.go","file.line":203},"message":"Shutting down Elastic Agent and sending last events...","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:07:21.519Z","log.origin":{"file.name":"operation/operator.go","file.line":223},"message":"waiting for installer of pipeline 'default' to finish","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:07:21.520Z","log.origin":{"file.name":"process/app.go","file.line":176},"message":"Signaling application to stop because of shutdown: fleet-server--8.1.0-SNAPSHOT","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"error","@timestamp":"2022-02-03T08:07:27.047Z","log.origin":{"file.name":"status/reporter.go","file.line":236},"message":"Elastic Agent status changed to: 'error'","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"error","@timestamp":"2022-02-03T08:07:27.047Z","log.origin":{"file.name":"log/reporter.go","file.line":36},"message":"2022-02-03T08:07:27Z - message: Application: fleet-server--8.1.0-SNAPSHOT[]: State changed to FAILED: Missed two check-ins - type: 'ERROR' - sub_type: 'FAILED'","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:07:51.570Z","log.origin":{"file.name":"status/reporter.go","file.line":236},"message":"Elastic Agent status changed to: 'online'","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:07:51.570Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-02-03T08:07:51Z - message: Application: fleet-server--8.1.0-SNAPSHOT[]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:07:51.570Z","log.origin":{"file.name":"cmd/run.go","file.line":211},"message":"Shutting down completed.","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m {"log.level":"info","@timestamp":"2022-02-03T08:07:51.570Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":66},"message":"Stats endpoint (/usr/share/elastic-agent/state/data/tmp/elastic-agent.sock) finished: accept unix /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock: use of closed network connection","ecs.version":"1.6.0"}
�[36mfleet-server_1               |�[0m Error: fleet-server failed: context canceled
�[36mfleet-server_1               |�[0m For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.1/fleet-troubleshooting.html
�[36mfleet-server_1               |�[0m Error: enrollment failed: exit status 1
�[36mfleet-server_1               |�[0m For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.1/fleet-troubleshooting.html

More logs: https://beats-ci.elastic.co/job/Ingest-manager/job/integrations/job/main/98/artifact/build/elastic-stack-dump/synthetics/logs/

It affects the Integrations main branch, incl. synthetics, containerd, etc.

Steps to reproduce:

elastic-package stack update -v -d --version 8.1.0-SNAPSHOT
elastic-package stack up -v -d --version 8.1.0-SNAPSHOT

Thanks for any help with investigating this problem.

cc @jlind23 @joshdover

@mtojek mtojek added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team labels Feb 3, 2022
@joshdover
Copy link
Contributor

Curious that there's no logs from Fleet server related to the policy it selected.

One possible workaround could be to add FLEET_SERVER_POLICY_ID=fleet-server-managed-ep to the env vars for Fleet Server here: https://github.com/elastic/elastic-package/blob/main/internal/profile/_static/docker-compose-stack.yml#L79

But we should figure out what the root issue here is regardless of if the workaround works.

@mtojek
Copy link
Contributor Author

mtojek commented Feb 3, 2022

Hey @joshdover, we tried it with Julia while working on migration to hosted policies. With FLEET_SERVER_POLICY_ID=fleet-server-managed-ep it will break compatibility with 7.x stack. If it isn't really necessary, I would postpone introducing this env. Otherwise, we'll have to implement some hack/forked stack.

@mtojek
Copy link
Contributor Author

mtojek commented Feb 3, 2022

We did some investigation and it seems that the root cause is in the Elastic Agent Docker image. The last stable one we managed to build is this one: elastic/elastic-package#683 (@sha256:a5c580573376d65ed2eba92d359b411cdae4bf52745af8e3bb8c0c91f8ce53a5).

It maps onto:

elastic-agent@cd978fd268e2:~$ ./elastic-agent diagnostics
elastic-agent  version: 8.1.0
               build_commit: 56b227d00945ae97d7e8663df048c29f311b8894  build_time: 2022-02-01 07:01:00 +0000 UTC  snapshot_build: true
Applications:
  *  name: metricbeat  route_key: default
     error: Get "http://unix/": dial unix /usr/share/elastic-agent/state/data/tmp/default/metricbeat/metricbeat.sock: connect: no such file or directory

@mtojek
Copy link
Contributor Author

mtojek commented Feb 3, 2022

We suspect that problem might have been introduced with this PR: elastic/beats#29031

cc @ph @blakerouse

@ph ph self-assigned this Feb 3, 2022
@ph
Copy link
Contributor

ph commented Feb 3, 2022

I will take a look, but I think the PR you have is the only major thing that I know that could have impact the agent.

@ph ph added the bug Something isn't working label Feb 3, 2022
@juliaElastic
Copy link
Contributor

juliaElastic commented Feb 3, 2022

@criamico @mtojek this change might be related as well: elastic/kibana#108252

when starting elastic-package locally, I see this in fleet-server logs:
Kibana Fleet setup failed: http POST request to http://kibana:5601/api/fleet/setup fails: Forbidden: <nil>. Response: {"statusCode":403,"error":"Forbidden","message":"Forbidden"}

what is more, .fleet-agents index is not created, I have a suspicion that fleet-server might not have access to fleet API at all

@joshdover
Copy link
Contributor

joshdover commented Feb 3, 2022

I don't think it's related to elastic/kibana#108252. I've been able to successfully run this on main without any issues as a manual test:

# Create new elastic/fleet-server token
curl --request POST \
  --url http://localhost:9200/_security/service/elastic/fleet-server/credential/token \
  -u elastic:changeme

# Copy token response into authz header below
curl --request POST \
  --url http://localhost:5601/api/fleet/setup \
  --header 'authorization: Bearer <token>' \
  --header 'content-type: application/json' \
  --header 'kbn-xsrf: x' 

Do we need to update the token that we're using? Maybe our manual hardcoded token isn't working anymore due to a change in ES?

@mtojek
Copy link
Contributor Author

mtojek commented Feb 3, 2022

@joshdover In this PR I forced the specific Docker image for Elastic Agent and it passed. Elasticsearch, Kibana images were the same.

@ph ph closed this as completed Feb 3, 2022
@ph ph reopened this Feb 3, 2022
@ph
Copy link
Contributor

ph commented Feb 3, 2022

OK, I went through the all commits in fleet-server, the latest commit that add actual code in the server is 4 days ago, https://github.com/elastic/fleet-server/pulls?q=is%3Apr+is%3Amerged I am going to concentrate on the Agent side of things.

ph added a commit to ph/beats that referenced this issue Feb 3, 2022
This revert the code of the APM Instrumentation of the Elastic Agent.
To unblock the build of and the CI for other team. This would require
more investigation to really understand the problem.

Fixes elastic/fleet-server#1129
@ph ph mentioned this issue Feb 3, 2022
6 tasks
@ph
Copy link
Contributor

ph commented Feb 3, 2022

This was a really deep rabbit hole, I took some time to have a running testing and a fast environment to debug using the AGENT_DROP_PATH and having part of Elastic-Agent precompiled and just building the current platform and architecture of the Elastic Agent. Looking at the behavior of the elastic-package the Fleet server was really waiting for an initial configuration. Looking at the Fleet in Kibana, I was able to see that the system was stuck in waiting the first enrollment. The other behavior of Fleet-Server was the heavy CPU usage, maybe the fleet-server is stuck in a live loop? I also tested outside of the Docker environment, and the bug was present there two.

Because of this, I've initially thought the problem was fleet-server. I've bisected the last good commit of the fleet server and the bug was still present. Now, everything was showing a problem with the Elastic Agent side, so I also did a bisect of the last good build to the working version. I was able to narrow it down to the actual APM instrumentation.

Looking at the implementations the traces should have been disabled by default and not impact any behavior of the agent. If I remove the whole PR the Elastic Agent was able to do the initial enrollment into Fleet without any problems. I look more closely at the code I've tried to remove the code of the gRPC interceptor but it did not fix the situation. I've decided to revert the whole implementation of the APM instrumentation and we will need to look into it more. I've detected that importing apmgprc has an init side effect. I've removed the import but it didn't fix the problem.

Reverting the PR was noted a simple revert, another pull request applied after had a conflicting change.

Looking at that PR, it was green except for the e2e CI, if the latter was working I am confident it would have caught that issues.

Action items:

  • Write a better documentation for testing this scenario.
  • Make the build way quicker, It takes a lot of minutes to have a working binary.
  • Write a post-morterm
  • E2E testing needs to be enabled back or have a simple job with the elastic-package to bring the stack.
  • Investigate high CPU usage
  • Investigate APM Instrumentation of Elastic Agent with @stuartnelson3

mergify bot pushed a commit to elastic/beats that referenced this issue Feb 4, 2022
* Revert #29031

This revert the code of the APM Instrumentation of the Elastic Agent.
To unblock the build of and the CI for other team. This would require
more investigation to really understand the problem.

Fixes elastic/fleet-server#1129

* fix make update

* fix linter

(cherry picked from commit 718c923)
@mtojek mtojek reopened this Feb 4, 2022
@mtojek
Copy link
Contributor Author

mtojek commented Feb 4, 2022

Let's keep it open until we confirm that it fixed.

@ph
Copy link
Contributor

ph commented Feb 4, 2022

interesting that the issue was closed from a forked repository? I don't remember seeing that ever.

@axw
Copy link
Member

axw commented Feb 8, 2022

I'm seeing what appears to be the same issue with docker.elastic.co/beats/elastic-agent:8.2.0-5d69c4c3-SNAPSHOT, which is built from elastic/beats@5529c31 (after the revert).

To reproduce, clone elastic/apm-server#7227 and run docker-compose up -d. Fleet Server fails to enroll,

Logs:

$ docker-compose logs fleet-server
Attaching to apm-server_fleet-server_1
fleet-server_1      | Requesting service_token from Kibana.
fleet-server_1      | Created service_token named: token-1644310962231
fleet-server_1      | Performing setup of Fleet in Kibana
fleet-server_1      | 
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:43.549Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":572},"message":"Spawning Elastic Agent daemon as a subprocess to complete bootstrap process.","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:43.766Z","log.origin":{"file.name":"application/application.go","file.line":68},"message":"Detecting execution mode","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:43.774Z","log.origin":{"file.name":"application/application.go","file.line":88},"message":"Agent is in Fleet Server bootstrap mode","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:44.557Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":744},"message":"Waiting for Elastic Agent to start Fleet Server","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:44.576Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":62},"message":"Starting stats endpoint","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:44.576Z","log.origin":{"file.name":"application/fleet_server_bootstrap.go","file.line":131},"message":"Agent is starting","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:44.576Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":64},"message":"Metrics endpoint listening on: /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock (configured: unix:///usr/share/elastic-agent/state/data/tmp/elastic-agent.sock)","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:44.577Z","log.origin":{"file.name":"application/fleet_server_bootstrap.go","file.line":141},"message":"Agent is stopped","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:46.731Z","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":48},"message":"New State ID is nLbqrZoq","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:46.731Z","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":49},"message":"Converging state requires execution of 1 step(s)","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:48.286Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-02-08T09:02:48Z - message: Application: fleet-server--8.2.0-SNAPSHOT[]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:48.287Z","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":66},"message":"Updating internal state","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:49.368Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-02-08T09:02:49Z - message: Application: fleet-server--8.2.0-SNAPSHOT[]: State changed to STARTING: Waiting on default policy with Fleet Server integration - type: 'STATE' - sub_type: 'STARTING'","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:02:50.566Z","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":777},"message":"Fleet Server - Waiting on default policy with Fleet Server integration","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:04:43.551Z","log.origin":{"file.name":"cmd/run.go","file.line":185},"message":"Shutting down Elastic Agent and sending last events...","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:04:43.551Z","log.origin":{"file.name":"operation/operator.go","file.line":216},"message":"waiting for installer of pipeline 'default' to finish","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:04:43.551Z","log.origin":{"file.name":"process/app.go","file.line":176},"message":"Signaling application to stop because of shutdown: fleet-server--8.2.0-SNAPSHOT","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:04:45.053Z","log.origin":{"file.name":"cmd/run.go","file.line":193},"message":"Shutting down completed.","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:04:45.053Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-02-08T09:04:45Z - message: Application: fleet-server--8.2.0-SNAPSHOT[]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'","ecs.version":"1.6.0"}
fleet-server_1      | {"log.level":"info","@timestamp":"2022-02-08T09:04:45.053Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":66},"message":"Stats endpoint (/usr/share/elastic-agent/state/data/tmp/elastic-agent.sock) finished: accept unix /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock: use of closed network connection","ecs.version":"1.6.0"}
fleet-server_1      | Error: fleet-server failed: context canceled
fleet-server_1      | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.2/fleet-troubleshooting.html
fleet-server_1      | Error: enrollment failed: exit status 1
fleet-server_1      | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.2/fleet-troubleshooting.html

@mtojek
Copy link
Contributor Author

mtojek commented Feb 8, 2022

@ph Could you please check what is the status of the elastic-agent Docker image? The issue still persists in Integrations.

@ph
Copy link
Contributor

ph commented Feb 8, 2022

@mtojek I am taking another look.

@ph
Copy link
Contributor

ph commented Feb 8, 2022

@mtojek Looking at the failure of the ci, this concern the 8.1 snapshots, and well I didn't merge elastic/beats#30209, I will double-check the failures and merge it. Is there a job that test on master?

@axw
Copy link
Member

axw commented Feb 8, 2022

@ph if you don't care about running the specific steps that @mtojek mentioned: the steps I listed in #1129 (comment) are for main (8.2.0-SNAPSHOT).

@ph
Copy link
Contributor

ph commented Feb 8, 2022

unix /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock: use of closed network connection","ecs.version":"1.6.0"}
fleet-server_1 | Error: fleet-server failed: context canceled
fleet-server_1 | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.2/fleet-troubleshooting.html
fleet-server_1 | Error: enrollment failed: exit status 1
fleet-server_1 | For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.2/fleet-troubleshooting.html
fleet-server_1 | {"log.level":"info","@timestamp":"2022-02-08T16:42:09.461Z","log.origin":

@mtojek
Copy link
Contributor Author

mtojek commented Feb 8, 2022

@mtojek Looking at the failure of the ci, this concern the 8.1 snapshots, and well I didn't merge elastic/beats#30209, I will double-check the failures and merge it. Is there a job that test on master?

For example, it fails for the Integrations master, containerd integration.

@ph
Copy link
Contributor

ph commented Feb 8, 2022

Build is 5529c31cf1bd68bf2ad089ef747186f9510ff3f1 and does include the revert

❯ git show 5529c31cf1bd68bf2ad089ef747186f9510ff3f1                                                                                                                                                                             [11:48:31]
commit 5529c31cf1bd68bf2ad089ef747186f9510ff3f1 (HEAD)
Author: Elastic Machine <[email protected]>
Date:   Mon Feb 7 10:17:17 2022 -0600

    [Release] update version to next minor 8.2.0 (#30160)

diff --git a/libbeat/version/version.go b/libbeat/version/version.go
index 873ae40db0..38249106a4 100644
--- a/libbeat/version/version.go
+++ b/libbeat/version/version.go
@@ -18,4 +18,4 @@
 // Code generated by dev-tools/set_version
 package version
 
-const defaultBeatVersion = "8.1.0"
+const defaultBeatVersion = "8.2.0

This is indeed a strange behavior, because I was able to reproduce the bug everytime with the instrumention commit and not without it.

@mtojek
Copy link
Contributor Author

mtojek commented Feb 8, 2022

I retriggered the main job. Let's see what's the current status: link

@ph
Copy link
Contributor

ph commented Feb 8, 2022

It should fail @mtojek I can reproduce the bug with the docker-images, the debug statement are lacking I am shooting a bit in the dark at that point.

@ph
Copy link
Contributor

ph commented Feb 8, 2022

Interesting logs in the Kibana side, not sure why we have multiple fleet-setup completed statement

[2022-02-08T16:39:23.863+00:00][INFO ][status] Kibana is now degraded (was available)
[2022-02-08T16:39:24.670+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2022-02-08T16:39:24.759+00:00][INFO ][plugins.fleet] Fleet setup completed
[2022-02-08T16:39:24.798+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2022-02-08T16:39:24.890+00:00][INFO ][plugins.fleet] Fleet setup completed
[2022-02-08T16:39:29.403+00:00][INFO ][status] Kibana is now available (was degraded)
[2022-02-08T17:00:25.540+00:00][INFO ][status] Kibana is now degraded (was available)
[2022-02-08T17:00:29.298+00:00][INFO ][status] Kibana is now available (was degraded)
[2022-02-08T17:02:38.529+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2022-02-08T17:02:38.615+00:00][INFO ][plugins.fleet] Fleet setup completed
[2022-02-08T17:02:38.632+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2022-02-08T17:02:38.718+00:00][INFO ][plugins.fleet] Fleet setup completed
[2022-02-08T17:07:14.094+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2022-02-08T17:07:14.145+00:00][INFO ][plugins.fleet] Fleet setup completed
[2022-02-08T17:07:14.168+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2022-02-08T17:07:14.217+00:00][INFO ][plugins.fleet] Fleet setup completed
[2022-02-08T17:14:45.142+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2022-02-08T17:14:45.208+00:00][INFO ][plugins.fleet] Fleet setup completed
[2022-02-08T17:14:45.226+00:00][INFO ][plugins.fleet] Beginning fleet setup
[2022-02-08T17:14:45.349+00:00][INFO ][plugins.fleet] Fleet setup completed
[2022-02-08T17:25:17.553+00:00][ERROR][plugins.taskManager] Failed to poll for work: Error: work has timed out
[2022-02-08T17:25:17.580+00:00][INFO ][status] Kibana is now degraded (was available)
[2022-02-08T17:25:18.802+00:00][WARN ][plugins.kibanaUsageCollection] Average event loop delay threshold exceeded 350ms. Received 10813.265237333333ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
[2022-02-08T17:25:29.266+00:00][INFO ][status] Kibana is now available (was degraded)

@ph
Copy link
Contributor

ph commented Feb 8, 2022

OK, I think we might have two different problems, let's start with the APM-Server, recently we have removed the autogeneration of configuration of fleet-server without a human 'intervention' elastic/kibana#108456. Looking at the APM docker-compose file at https://github.com/elastic/apm-server/blob/main/docker-compose.yml#L41-L63 we never configure the default fleet server configuration. So this aligns with we what see in the log Fleet Server is waiting on a configuration that will never exist. Elastic Package has creared a PR with elastic/elastic-package#676

Now I will check with the elastic-package.

@ph
Copy link
Contributor

ph commented Feb 8, 2022

When I've tested #1129 (comment) I didn't use the container subcommand and did use the link from the kibana UI, so in that case Kibana generates the appropriate server configuration.

@ph
Copy link
Contributor

ph commented Feb 8, 2022

Added notes here, using this configuration yield a few deprecation warning from elastic/kibana#108456 (comment)

xpack.fleet.agentPolicies:
  - name: Agent policy 1
    description: Agent policy 1
    is_managed: false
    namespace: default
    monitoring_enabled:
      - logs
      - metrics
    package_policies:
      - name: system-1
        id: default-system
        package:
          name: system
  - name: Fleet Server policy preconfigured
    id: fleet-server-policy
    namespace: default
    package_policies:
      - name: Fleet Server
        package:
          name: fleet_server

[2022-02-08T20:34:28.207+00:00][WARN ][config.deprecation] Config key [xpack.fleet.agentPolicies.is_default] is deprecated.
[2022-02-08T20:34:28.208+00:00][WARN ][config.deprecation] Config key [xpack.fleet.agentPolicies.is_default_fleet_server] is deprecated.
[2022-02-08T20:34:28.208+00:00][WARN ][config.deprecation] Config key [xpack.fleet.agents.elasticsearch.host] is deprecated and replaced by

@ph
Copy link
Contributor

ph commented Feb 8, 2022

I still think it's something only when using an automation workflow, when I have a user journey it seems to work at least outside of containers.

@ph
Copy link
Contributor

ph commented Feb 8, 2022

OK, 8.1.0 is stuck in a failure loop on Fleet-Server, the server is not even started. This is exactly what marcin had.
Going to do the same thing in the 8.2.0 artifacts.

@ph
Copy link
Contributor

ph commented Feb 8, 2022

OK, 8.2.0 elastic-package stack works for me.

Creating network "elastic-package-stack_default" with the default driver
Creating elastic-package-stack_elasticsearch_1    ... done
Creating elastic-package-stack_package-registry_1 ... done
Creating elastic-package-stack_package-registry_is_ready_1 ... done
Creating elastic-package-stack_kibana_1                    ... done
Creating elastic-package-stack_elasticsearch_is_ready_1    ... done
Creating elastic-package-stack_fleet-server_1              ... done
Creating elastic-package-stack_kibana_is_ready_1           ... done
Creating elastic-package-stack_elastic-agent_1             ... done
Creating elastic-package-stack_fleet-server_is_ready_1     ... done
Creating elastic-package-stack_elastic-agent_is_ready_1    ... done
Done

Logging into Kibana show both Elastic Agent connecting to it. everything seems to be enrolled fine.
I don't know how it's used in the CI but if the 8.2.0 works. I wonder if the asserts or the configuration of the integration has issues in the CI.

@jlind23 @axw The main difference in 8.2.0 and 8.1.0 is really the instrumentation. fleet-server is identical.

@mtojek
Copy link
Contributor Author

mtojek commented Feb 9, 2022

Thanks, @ph, for working on this to reduce the blast.

I opened a similar PR to verify the 8.2.0 stack: elastic/elastic-package#692

Hey @simitt @axw @stuartnelson3, I suppose you've been already researching the APM instrumentation issue. Could you please share more details or link the issue, so we can learn what went wrong here? My bet is an undetected library conflict somewhere around GRPC.

@ph
Copy link
Contributor

ph commented Feb 9, 2022

Elastic-package and apm-server problem is fixed so I am going to close this issue, If there is a problem we can reopen it.

@ph ph closed this as completed Feb 9, 2022
@simitt
Copy link

simitt commented Feb 11, 2022

Hey @simitt @axw @stuartnelson3, I suppose you've been already researching the APM instrumentation issue. Could you please share more details or link the issue, so we can learn what went wrong here? My bet is an undetected library conflict somewhere around GRPC.

@stuartnelson3 is looking into this.

leweafan pushed a commit to leweafan/beats that referenced this issue Apr 28, 2023
This revert the code of the APM Instrumentation of the Elastic Agent.
To unblock the build of and the CI for other team. This would require
more investigation to really understand the problem.

Fixes elastic/fleet-server#1129
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants