beats locker should be pid aware #31670

michel-laterman · 2022-05-18T17:05:25Z

Currently it is possible for a beat to incorrectly shut down and not remove the instance locker for the data path:

Lines 50 to 61 in 1d78d15

    
           func (l *locker) lock() error { 
        
           	isLocked, err := l.fl.TryLock() 
        
           	if err != nil { 
        
           		return errors.Wrap(err, "unable to lock data path") 
        
           	} 
        
           	if !isLocked { 
        
           		return ErrAlreadyLocked 
        
           	} 
        
           	return nil 
        
           }

This can occur (irregularly) when a beat is running under the elastic-agent on a machine that goes to sleep/hibernation.

The data-path lock should be aware of the pid of the process that locks the directory.
If the lock(or pid file) exists but the process identified by the pid does not exist, the lock should be considered invalid and removed/replaced

Relates metricbeat restart loop with error: data path already locked by another beat elastic-agent#914

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-05-18T17:05:28Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

fearful-symmetry · 2022-09-21T17:13:09Z

Alright, trying to spec out a few different ways to do this. This is a tad complex, as different operating systems have different file-locking implementations. The best solution here is probably to have a separate pidfile that we can check during the locking/unlocking logic.

cmacknz · 2022-09-29T13:31:29Z

It seems like the agent V2 branch is affected by this: elastic/elastic-agent#997

At least the V2 branch can create a situation that triggers this problem, likely there is another bug with the V2 agent creating and killing the monitoring Beats rapidly.

cmacknz · 2022-09-29T13:34:27Z

We have several reports that this problem is more common when the system running the agent goes to sleep or hibernates. We should explicitly try to test this situation with the fix to ensure it behaves reasonable.

amolnater-qasource · 2022-10-21T07:33:44Z

Hi @joshdover @jlind23 @cmacknz
We have revalidated this PR and elastic/elastic-agent#914 on 8.6 SNAPSHOT.

Steps followed:

Installed MAC and Windows fleet based agents with System integration.
Reassigned these agents to new Policy.[Multiple times]
Restarted VMs.
Shutdown VMs for approximately 30 mins and restarted again.

Observations:

We didn't observe any Metricbeat failure errors.
Filebeat and Metricbeat gets back to RUNNING state.
No below error line:
metricbeat stderr: Exiting: data path already locked by another beat. Please make sure that multiple beats are not sharing the same data path (path.data).

Please let us know if we are missing any scenario here.
Thank You!

HaniAlshikh · 2022-11-18T13:35:16Z

Hi, unfortunately, after updating to 8.5.1 we are still observing the same error but now with a different message.

{"log.level":"error","@timestamp":"2022-11-18T11:36:01.500Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"filebeat stderr: \"Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 66\\n\"","agent.console.name":"filebeat","agent.console.type":"stderr","ecs.version":"1.6.0"}

While running 8.5.0 we observed the same error mentioned above and in elastic/elastic-agent#997.

We have a DeamonSet of the agents running on multiple clusters. Here are the logs from one of them

elastic.log

cmacknz · 2022-11-18T16:25:14Z

Edit: misread the comment, this is a report against 8.5.1 which should have fixed this.

@fearful-symmetry any ideas on what is going on here?

fearful-symmetry · 2022-11-21T16:25:51Z

Sorry about the delay, looks like I forgot to check my github notifications on Friday.
@HaniAlshikh so, if we get the data directory belongs to process with ... error, that means that the beat sees a running process that owns the existing [BEATNAME].pid file.

A few things:

How reproducible is this? Is it happening 100% of the time? Just occasionally? What OS?
What process on the system matches that PID, and what state is that process in? For example, check with ps aux | grep 66 on linux.

ajoliveira · 2022-11-21T22:25:33Z

I was able to also reproduce this running locally (MacOS) with a minikube setup. I deployed one of our ECK recipes with Fleet/Elastic Agent, noticed some missing 'ClusterRole's, made the edit and reapplied the config. The new Elastic Agent pods on 2 hosts then started logging errors, both for Metricbeat and Filebeat running in the pod, ex:

{"log.level":"error","@timestamp":"2022-11-21T20:38:52.073Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"metricbeat stderr: \"Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 951\\n\"","agent.console.name":"metricbeat","agent.console.type":"stderr","ecs.version":"1.6.0"}


root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# for file in metricbeat--8.5.1*/*.lock ; do echo $file ; cat $file; echo ; done
metricbeat--8.5.1--36643631373035623733363936343635/metricbeat.lock
{"pid":1007,"write_time":"2022-11-21T20:26:16.647883019Z"}
metricbeat--8.5.1/metricbeat.lock
{"pid":951,"write_time":"2022-11-21T14:38:10.679257079Z"}

root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# rm metricbeat--8.5.1/metricbeat.lock

root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# for file in metricbeat--8.5.1*/*.lock ; do echo $file ; cat $file; echo ; done
metricbeat--8.5.1--36643631373035623733363936343635/metricbeat.lock
{"pid":1007,"write_time":"2022-11-21T20:26:16.647883019Z"}

root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# uname -a
Linux minikube-m03 5.15.49-linuxkit #1 SMP Tue Sep 13 07:51:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I could not find any process running with the noted PIDs so it appears it is not following through and removing that 'old' lock file (?). I deleted both manually and after a couple of minutes, Elastic Agents began to report healthy again.

Also - in case not already noted - spelling of 'connot' (cannot)?

fearful-symmetry · 2022-11-21T23:15:01Z

@ajoliveira can you share the recipe and steps you took? Trying to reproduce this, but I don't have much minikube experience.

This isn't a clean up issue, the issue is the beat detecting that an existing process with that PID, so it assumes another beat is already running and shuts down. I wonder if this is an issue involving k8s process namespacing somwhow, or perhaps how k8s in particular is restarting the process.

@ajoliveira 's issue is on darwin, so I'm curious what @HaniAlshikh is running on, since it could be specific to the process API used by beats.

If this turns out to be an issue with how elastic-agent or k8s is handling restarts, I wonder if we should just disable lockfiles for beats under elastic-agent, and make sure the agent is sufficiently smart when detecting when it's child processes are or aren't running.

fearful-symmetry · 2022-11-21T23:18:13Z

Also @ajoliveira can I see the environment variables that agent is running with as well?

ajoliveira · 2022-11-21T23:36:23Z

@fearful-symmetry Recipe I was using from elastic/cloud-on-k8s, specifically the 'fleet-kubernetes-integration' example.

In trying to reproduce a separate issue, I wanted to add POD_NAME as a variable to make it easier to identify Elastic Agent holding leader lease lock vs. unique ID that does not correlate to specific pod ... as well as adding more API resources to the ClusterRoles to allow collection of data for PVs, PVCs, etc. I'm assuming any other changes could do same if pods are replaced (?)

Not sure if you mean ENV vars that I'm setting, only one I added was for POD_NAME:

        containers:
        - name: agent
          securityContext:
            runAsUser: 0
          env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name

But also including output of env | sort from one of the pods - elasti-agent-pod-env.txt

Additional data points just in case: minikube v1.27.0 on Darwin 12.6.1 w/ Kubernetes 1.25.0

fearful-symmetry · 2022-11-21T23:51:25Z

Gonna try to reproduce this a bit more. It seems like there's two possibilities:

When a beat restarts, the old beat is still running and hasn't fully shut down
The process has shut down, but the OS hasn't fully cleaned up the resources, so it's reported as still running.
In the time between restarts, another process has started and taken the PID the beat previously had. On docker, PIDs are usually incremented as they're started, so I'm not sure how likely this is.

HaniAlshikh · 2022-11-22T09:17:03Z

@fearful-symmetry regarding your questions:

How reproducible is this? Is it happening 100% of the time? Just occasionally? What OS?
Yes, 100% of the time but not for all agents at once from ~120 agents we had ~20-30 agents affected but the agents seemed to alternate between having the problem and not having it (so basically all agents but not at the same time)
we are running the official image so: (docker.elastic.co/beats/elastic-agent-complete)

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"

What process on the system matches that PID, and what state is that process in? For example, check with ps aux | grep 66 on linux.
this is old data as we had to downgrade to fix the problem but at the time we noticed, that once we got the log message then it is too late to investigate the process. We didn't really invest much time here but this how ps aux looked like

root@es-agent-elastic-agent-64fqx:/usr/share/elastic-agent# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2500   528 ?        Ss   Nov17   0:00 /usr/bin/tini -- /usr/local/bin/docker-entrypoint
root           7  101  1.3 2214468 432280 ?      Sl   Nov17 1343:10 elastic-agent container
root       14953  0.0  0.0   4248  3424 pts/0    Ss   09:18   0:00 bash
root       14961  0.0  0.0   5900  2752 pts/0    R+   09:18   0:00 ps -aux

joshdover · 2022-11-22T10:09:15Z

Reopening while we evaluate the new reports

cmacknz · 2022-11-28T15:19:13Z

Looks like we are still seeing this in some of our own scale tests on VMs: elastic/elastic-agent#1790 (comment)

AndersonQ · 2022-11-28T15:48:19Z

a bit of a wild thought, I haven't any data to support it, but could it be related to the panic: close of closed channel... we've seen, but so far haven't caught any impact of it?

cmacknz · 2022-12-06T19:55:39Z

a bit of a wild thought, I haven't any data to support it, but could it be related to the panic: close of closed channel... we've seen, but so far haven't caught any impact of it?

I think this is definitely possible. Talking with @fearful-symmetry we should actually investigate if we even need the Beats data path locks under agent, since they cannot be installed with overlapping data paths by agent. We can hopefully just bypass the file locking when a Beat detects it is run with the management.enabled flag set.

It would be much simpler to just delete the possibility of any future file lock problems.

cmacknz · 2022-12-13T19:58:14Z

Closing, we have removed use of these lock files when Beats are run under agent in 8.7.

threatangler-jp · 2023-05-15T12:47:49Z

Closing, we have removed use of these lock files when Beats are run under agent in 8.7.

@cmacknz Is there a workaround for those who cannot yet upgrade to 8.7?

bm11100 · 2023-05-16T15:09:28Z

Closing, we have removed use of these lock files when Beats are run under agent in 8.7.

@cmacknz Is there a workaround for those who cannot yet upgrade to 8.7?

same question @cmacknz, its affecting about 400 of our devices on 8.5

cmacknz · 2023-05-16T18:46:58Z

Manually deleting the lock files before Filebeat starts should be the work around here. Wrapping agent or filebeat in a script that unconditionally deletes them is one way, they aren't actually necessary.

threatangler-jp · 2023-05-16T18:52:58Z

@cmacknz is there a PowerShell command you could provide to do this workaround?

michel-laterman added bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 18, 2022

jlind23 added the 8.5-candidate label May 19, 2022

michel-laterman mentioned this issue Aug 12, 2022

metricbeat restart loop with error: data path already locked by another beat elastic/elastic-agent#914

Closed

cmacknz added 8.6-candidate v8.5.0 and removed 8.5-candidate 8.6-candidate labels Sep 21, 2022

jlind23 assigned fearful-symmetry Sep 21, 2022

cmacknz mentioned this issue Sep 28, 2022

add pid awareness to file locking #33169

Merged

6 tasks

cmacknz mentioned this issue Sep 29, 2022

V2: Agent starting multiple instances of filebeat or metricbeat and goes into tight restart loop for the beat process elastic/elastic-agent#997

Closed

fearful-symmetry closed this as completed in #33169 Oct 12, 2022

joshdover reopened this Nov 22, 2022

This was referenced Nov 23, 2022

some agents go offline, some agents report startup issues elastic/elastic-agent#1772

Open

8.5.1 agents go unhealthy elastic/elastic-agent#1790

Closed

This was referenced Dec 7, 2022

Add do.Once protection to beater interface #33971

Merged

Disable lockfile under fleet mode #33988

Merged

cmacknz closed this as completed Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beats locker should be pid aware #31670

beats locker should be pid aware #31670

michel-laterman commented May 18, 2022 •

edited by cmacknz

Loading

elasticmachine commented May 18, 2022

fearful-symmetry commented Sep 21, 2022

cmacknz commented Sep 29, 2022

cmacknz commented Sep 29, 2022

amolnater-qasource commented Oct 21, 2022

HaniAlshikh commented Nov 18, 2022

cmacknz commented Nov 18, 2022 •

edited

Loading

fearful-symmetry commented Nov 21, 2022 •

edited

Loading

ajoliveira commented Nov 21, 2022

fearful-symmetry commented Nov 21, 2022

fearful-symmetry commented Nov 21, 2022

ajoliveira commented Nov 21, 2022

fearful-symmetry commented Nov 21, 2022 •

edited

Loading

HaniAlshikh commented Nov 22, 2022

joshdover commented Nov 22, 2022

cmacknz commented Nov 28, 2022 •

edited

Loading

AndersonQ commented Nov 28, 2022

cmacknz commented Dec 6, 2022

cmacknz commented Dec 13, 2022

threatangler-jp commented May 15, 2023

bm11100 commented May 16, 2023 •

edited

Loading

cmacknz commented May 16, 2023

threatangler-jp commented May 16, 2023

beats locker should be pid aware #31670

beats locker should be pid aware #31670

Comments

michel-laterman commented May 18, 2022 • edited by cmacknz Loading

elasticmachine commented May 18, 2022

fearful-symmetry commented Sep 21, 2022

cmacknz commented Sep 29, 2022

cmacknz commented Sep 29, 2022

amolnater-qasource commented Oct 21, 2022

HaniAlshikh commented Nov 18, 2022

cmacknz commented Nov 18, 2022 • edited Loading

fearful-symmetry commented Nov 21, 2022 • edited Loading

ajoliveira commented Nov 21, 2022

fearful-symmetry commented Nov 21, 2022

fearful-symmetry commented Nov 21, 2022

ajoliveira commented Nov 21, 2022

fearful-symmetry commented Nov 21, 2022 • edited Loading

HaniAlshikh commented Nov 22, 2022

joshdover commented Nov 22, 2022

cmacknz commented Nov 28, 2022 • edited Loading

AndersonQ commented Nov 28, 2022

cmacknz commented Dec 6, 2022

cmacknz commented Dec 13, 2022

threatangler-jp commented May 15, 2023

bm11100 commented May 16, 2023 • edited Loading

cmacknz commented May 16, 2023

threatangler-jp commented May 16, 2023

michel-laterman commented May 18, 2022 •

edited by cmacknz

Loading

cmacknz commented Nov 18, 2022 •

edited

Loading

fearful-symmetry commented Nov 21, 2022 •

edited

Loading

fearful-symmetry commented Nov 21, 2022 •

edited

Loading

cmacknz commented Nov 28, 2022 •

edited

Loading

bm11100 commented May 16, 2023 •

edited

Loading