-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8.5.1 agents go unhealthy #1790
Comments
Here are the logs from one of the VMs:
|
Here is the status:
|
@belimawr has a bug for this problem tracked elsewhere, I couldn't find the link looking quickly. Does Filebeat eventually recover from this? That is usually what we observe. If it doesn't we likely need to raise the severity of the bug. |
Found the relevant bug elastic/beats#33653 |
Might be unrelated, so I'm just posting to add a data point, but I've been chasing the same "operation skipped" loop messages for the last day or so. The difference in my case is that I can get new agents to start up if they enrol with an Elasticsearch output, but not with a Logstash output - they will sit forever and never reach healthy. Once they've enrolled with an Elasticsearch output I can switch back to Logstash and everything seems fine, unless I restart an agent at which point I'm back in the same unhealthy / startup loop. If I then switch back to Elasticsearch output, the unhealthy agents all come back on line. |
I also observed the issue with 8.5.2 agents. I left this setup running over the weekend and it did not recover, my agents are still marked by Fleet as unhealthy. The problem seems to happen when ever I make a change that results in a lot of network activity. Often it is a when moving the agents between policies.
|
@ceeeekay There is at least one known bug where the policy revision can get out of sync between the agent and Fleet using the Logstash output: elastic/fleet-server#2105. Possibly that is related to the problems you are observing. |
@pjbertels looking at the error This should have been fixed, the original issue was reopened here: elastic/beats#31670 |
@cmacknz That looks like the one - thanks. |
@pjbertels what environment is fleet running in here? K8s? Docker? Native? I've recently ran into some issues with how this behaves inside certain k8s environments where the namespace's PID counter will reset but the filesystem remains the same, meaning the PID in the lockfile can be mapped to a new process. Also, have we confirmed that the beats aren't trying to run multiple instances out of the same data directory? |
This is an agent installed in an Ubuntu VM in GCP. The VM is created and the sw is installed in the usual way by a tool we have(OGC). Fleet is running in ESS. There are a few issues that have been identified and we will retest when we can get those fixes. |
I can only imagine that the problem is indeed a change in the policy, which makes the agent restart and for some reason it does not restart correctly |
We think the issue is this one ... elastic/beats#33653. |
More info from my side: agent version: no matter how or in which order i enroll it, if the docker metrics and logs integration is present filebeat will fail.. (and result in the agent appearing as unhealthy)
|
@osamu-kj |
@cmacknz in the meantime i've upgraded elk and all of the agents to 8.5.2 and now its working fine, i'll let you know if something goes wrong on my side as well or similar.. thanks for mentioning it :) |
@pjbertels is this still occurring or can we close that issue? |
Not relevant anymore, closing. |
Issues encountered during Fleet Scaling testing with drones and a subset of real VMs.
Version: 8.5.1
Operating System: Linux Ubuntu VM (e2-standard-8)
Steps to Reproduce:
We used some tooling to bring up 200 VMs and 9800 Horde drones, some VM's report errors in the logs on the way up and take longer to come up. Once we begin testing some VMs go unhealthy(33/200).
The text was updated successfully, but these errors were encountered: