-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
beats locker should be pid aware #31670
Comments
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
Alright, trying to spec out a few different ways to do this. This is a tad complex, as different operating systems have different file-locking implementations. The best solution here is probably to have a separate pidfile that we can check during the locking/unlocking logic. |
It seems like the agent V2 branch is affected by this: elastic/elastic-agent#997 At least the V2 branch can create a situation that triggers this problem, likely there is another bug with the V2 agent creating and killing the monitoring Beats rapidly. |
We have several reports that this problem is more common when the system running the agent goes to sleep or hibernates. We should explicitly try to test this situation with the fix to ensure it behaves reasonable. |
Hi @joshdover @jlind23 @cmacknz Steps followed:
Observations:
Please let us know if we are missing any scenario here. |
Hi, unfortunately, after updating to
While running We have a DeamonSet of the agents running on multiple clusters. Here are the logs from one of them |
Edit: misread the comment, this is a report against 8.5.1 which should have fixed this. @fearful-symmetry any ideas on what is going on here? |
Sorry about the delay, looks like I forgot to check my github notifications on Friday. A few things:
|
I was able to also reproduce this running locally (MacOS) with a minikube setup. I deployed one of our ECK recipes with Fleet/Elastic Agent, noticed some missing 'ClusterRole's, made the edit and reapplied the config. The new Elastic Agent pods on 2 hosts then started logging errors, both for Metricbeat and Filebeat running in the pod, ex:
I could not find any process running with the noted PIDs so it appears it is not following through and removing that 'old' lock file (?). I deleted both manually and after a couple of minutes, Elastic Agents began to report healthy again. Also - in case not already noted - spelling of 'connot' (cannot)? |
@ajoliveira can you share the recipe and steps you took? Trying to reproduce this, but I don't have much minikube experience. This isn't a clean up issue, the issue is the beat detecting that an existing process with that PID, so it assumes another beat is already running and shuts down. I wonder if this is an issue involving k8s process namespacing somwhow, or perhaps how k8s in particular is restarting the process. @ajoliveira 's issue is on darwin, so I'm curious what @HaniAlshikh is running on, since it could be specific to the process API used by beats. If this turns out to be an issue with how elastic-agent or k8s is handling restarts, I wonder if we should just disable lockfiles for beats under elastic-agent, and make sure the agent is sufficiently smart when detecting when it's child processes are or aren't running. |
Also @ajoliveira can I see the environment variables that agent is running with as well? |
@fearful-symmetry Recipe I was using from elastic/cloud-on-k8s, specifically the 'fleet-kubernetes-integration' example. In trying to reproduce a separate issue, I wanted to add Not sure if you mean ENV vars that I'm setting, only one I added was for
But also including output of Additional data points just in case: minikube v1.27.0 on Darwin 12.6.1 w/ Kubernetes 1.25.0 |
Gonna try to reproduce this a bit more. It seems like there's two possibilities:
|
@fearful-symmetry regarding your questions:
|
Reopening while we evaluate the new reports |
Looks like we are still seeing this in some of our own scale tests on VMs: elastic/elastic-agent#1790 (comment) |
a bit of a wild thought, I haven't any data to support it, but could it be related to the |
I think this is definitely possible. Talking with @fearful-symmetry we should actually investigate if we even need the Beats data path locks under agent, since they cannot be installed with overlapping data paths by agent. We can hopefully just bypass the file locking when a Beat detects it is run with the management.enabled flag set. It would be much simpler to just delete the possibility of any future file lock problems. |
Closing, we have removed use of these lock files when Beats are run under agent in 8.7. |
@cmacknz Is there a workaround for those who cannot yet upgrade to 8.7? |
Manually deleting the lock files before Filebeat starts should be the work around here. Wrapping agent or filebeat in a script that unconditionally deletes them is one way, they aren't actually necessary. |
@cmacknz is there a PowerShell command you could provide to do this workaround? |
Currently it is possible for a beat to incorrectly shut down and not remove the instance locker for the data path:
beats/libbeat/cmd/instance/locker.go
Lines 50 to 61 in 1d78d15
This can occur (irregularly) when a beat is running under the elastic-agent on a machine that goes to sleep/hibernation.
The data-path lock should be aware of the pid of the process that locks the directory.
If the lock(or pid file) exists but the process identified by the pid does not exist, the lock should be considered invalid and removed/replaced
data path already locked by another beat
elastic-agent#914The text was updated successfully, but these errors were encountered: