Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beats locker should be pid aware #31670

Closed
1 task done
michel-laterman opened this issue May 18, 2022 · 23 comments · Fixed by #33169
Closed
1 task done

beats locker should be pid aware #31670

michel-laterman opened this issue May 18, 2022 · 23 comments · Fixed by #33169
Assignees
Labels
bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team v8.5.0

Comments

@michel-laterman
Copy link
Contributor

michel-laterman commented May 18, 2022

Currently it is possible for a beat to incorrectly shut down and not remove the instance locker for the data path:

func (l *locker) lock() error {
isLocked, err := l.fl.TryLock()
if err != nil {
return errors.Wrap(err, "unable to lock data path")
}
if !isLocked {
return ErrAlreadyLocked
}
return nil
}

This can occur (irregularly) when a beat is running under the elastic-agent on a machine that goes to sleep/hibernation.

The data-path lock should be aware of the pid of the process that locks the directory.
If the lock(or pid file) exists but the process identified by the pid does not exist, the lock should be considered invalid and removed/replaced

@michel-laterman michel-laterman added bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels May 18, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@fearful-symmetry
Copy link
Contributor

Alright, trying to spec out a few different ways to do this. This is a tad complex, as different operating systems have different file-locking implementations. The best solution here is probably to have a separate pidfile that we can check during the locking/unlocking logic.

@cmacknz
Copy link
Member

cmacknz commented Sep 29, 2022

It seems like the agent V2 branch is affected by this: elastic/elastic-agent#997

At least the V2 branch can create a situation that triggers this problem, likely there is another bug with the V2 agent creating and killing the monitoring Beats rapidly.

@cmacknz
Copy link
Member

cmacknz commented Sep 29, 2022

We have several reports that this problem is more common when the system running the agent goes to sleep or hibernates. We should explicitly try to test this situation with the fix to ensure it behaves reasonable.

@amolnater-qasource
Copy link

Hi @joshdover @jlind23 @cmacknz
We have revalidated this PR and elastic/elastic-agent#914 on 8.6 SNAPSHOT.

Steps followed:

  • Installed MAC and Windows fleet based agents with System integration.
  • Reassigned these agents to new Policy.[Multiple times]
  • Restarted VMs.
  • Shutdown VMs for approximately 30 mins and restarted again.

Observations:

  • We didn't observe any Metricbeat failure errors.
  • Filebeat and Metricbeat gets back to RUNNING state.
  • No below error line:
    metricbeat stderr: Exiting: data path already locked by another beat. Please make sure that multiple beats are not sharing the same data path (path.data).

Please let us know if we are missing any scenario here.
Thank You!

@HaniAlshikh
Copy link

Hi, unfortunately, after updating to 8.5.1 we are still observing the same error but now with a different message.

{"log.level":"error","@timestamp":"2022-11-18T11:36:01.500Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"filebeat stderr: \"Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 66\\n\"","agent.console.name":"filebeat","agent.console.type":"stderr","ecs.version":"1.6.0"}

While running 8.5.0 we observed the same error mentioned above and in elastic/elastic-agent#997.

We have a DeamonSet of the agents running on multiple clusters. Here are the logs from one of them

elastic.log

@cmacknz
Copy link
Member

cmacknz commented Nov 18, 2022

Edit: misread the comment, this is a report against 8.5.1 which should have fixed this.

@fearful-symmetry any ideas on what is going on here?

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Nov 21, 2022

Sorry about the delay, looks like I forgot to check my github notifications on Friday.
@HaniAlshikh so, if we get the data directory belongs to process with ... error, that means that the beat sees a running process that owns the existing [BEATNAME].pid file.

A few things:

  • How reproducible is this? Is it happening 100% of the time? Just occasionally? What OS?
  • What process on the system matches that PID, and what state is that process in? For example, check with ps aux | grep 66 on linux.

@ajoliveira
Copy link
Contributor

I was able to also reproduce this running locally (MacOS) with a minikube setup. I deployed one of our ECK recipes with Fleet/Elastic Agent, noticed some missing 'ClusterRole's, made the edit and reapplied the config. The new Elastic Agent pods on 2 hosts then started logging errors, both for Metricbeat and Filebeat running in the pod, ex:

{"log.level":"error","@timestamp":"2022-11-21T20:38:52.073Z","log.origin":{"file.name":"process/stdlogger.go","file.line":54},"message":"metricbeat stderr: \"Exiting: cannot obtain lockfile: connot start, data directory belongs to process with pid 951\\n\"","agent.console.name":"metricbeat","agent.console.type":"stderr","ecs.version":"1.6.0"}


root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# for file in metricbeat--8.5.1*/*.lock ; do echo $file ; cat $file; echo ; done
metricbeat--8.5.1--36643631373035623733363936343635/metricbeat.lock
{"pid":1007,"write_time":"2022-11-21T20:26:16.647883019Z"}
metricbeat--8.5.1/metricbeat.lock
{"pid":951,"write_time":"2022-11-21T14:38:10.679257079Z"}

root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# rm metricbeat--8.5.1/metricbeat.lock

root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# for file in metricbeat--8.5.1*/*.lock ; do echo $file ; cat $file; echo ; done
metricbeat--8.5.1--36643631373035623733363936343635/metricbeat.lock
{"pid":1007,"write_time":"2022-11-21T20:26:16.647883019Z"}

root@minikube-m03:/usr/share/elastic-agent/state/data/run/default# uname -a
Linux minikube-m03 5.15.49-linuxkit #1 SMP Tue Sep 13 07:51:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I could not find any process running with the noted PIDs so it appears it is not following through and removing that 'old' lock file (?). I deleted both manually and after a couple of minutes, Elastic Agents began to report healthy again.

Also - in case not already noted - spelling of 'connot' (cannot)?

@fearful-symmetry
Copy link
Contributor

@ajoliveira can you share the recipe and steps you took? Trying to reproduce this, but I don't have much minikube experience.

This isn't a clean up issue, the issue is the beat detecting that an existing process with that PID, so it assumes another beat is already running and shuts down. I wonder if this is an issue involving k8s process namespacing somwhow, or perhaps how k8s in particular is restarting the process.

@ajoliveira 's issue is on darwin, so I'm curious what @HaniAlshikh is running on, since it could be specific to the process API used by beats.

If this turns out to be an issue with how elastic-agent or k8s is handling restarts, I wonder if we should just disable lockfiles for beats under elastic-agent, and make sure the agent is sufficiently smart when detecting when it's child processes are or aren't running.

@fearful-symmetry
Copy link
Contributor

Also @ajoliveira can I see the environment variables that agent is running with as well?

@ajoliveira
Copy link
Contributor

@fearful-symmetry Recipe I was using from elastic/cloud-on-k8s, specifically the 'fleet-kubernetes-integration' example.

In trying to reproduce a separate issue, I wanted to add POD_NAME as a variable to make it easier to identify Elastic Agent holding leader lease lock vs. unique ID that does not correlate to specific pod ... as well as adding more API resources to the ClusterRoles to allow collection of data for PVs, PVCs, etc. I'm assuming any other changes could do same if pods are replaced (?)

Not sure if you mean ENV vars that I'm setting, only one I added was for POD_NAME:

        containers:
        - name: agent
          securityContext:
            runAsUser: 0
          env:
          - name: POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name

But also including output of env | sort from one of the pods - elasti-agent-pod-env.txt

Additional data points just in case: minikube v1.27.0 on Darwin 12.6.1 w/ Kubernetes 1.25.0

@fearful-symmetry
Copy link
Contributor

fearful-symmetry commented Nov 21, 2022

Gonna try to reproduce this a bit more. It seems like there's two possibilities:

  • When a beat restarts, the old beat is still running and hasn't fully shut down
  • The process has shut down, but the OS hasn't fully cleaned up the resources, so it's reported as still running.
  • In the time between restarts, another process has started and taken the PID the beat previously had. On docker, PIDs are usually incremented as they're started, so I'm not sure how likely this is.

@HaniAlshikh
Copy link

@fearful-symmetry regarding your questions:

  • How reproducible is this? Is it happening 100% of the time? Just occasionally? What OS?
    Yes, 100% of the time but not for all agents at once from ~120 agents we had ~20-30 agents affected but the agents seemed to alternate between having the problem and not having it (so basically all agents but not at the same time)
    we are running the official image so: (docker.elastic.co/beats/elastic-agent-complete)
NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"
  • What process on the system matches that PID, and what state is that process in? For example, check with ps aux | grep 66 on linux.
    this is old data as we had to downgrade to fix the problem but at the time we noticed, that once we got the log message then it is too late to investigate the process. We didn't really invest much time here but this how ps aux looked like
root@es-agent-elastic-agent-64fqx:/usr/share/elastic-agent# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2500   528 ?        Ss   Nov17   0:00 /usr/bin/tini -- /usr/local/bin/docker-entrypoint
root           7  101  1.3 2214468 432280 ?      Sl   Nov17 1343:10 elastic-agent container
root       14953  0.0  0.0   4248  3424 pts/0    Ss   09:18   0:00 bash
root       14961  0.0  0.0   5900  2752 pts/0    R+   09:18   0:00 ps -aux

@joshdover
Copy link
Contributor

Reopening while we evaluate the new reports

@cmacknz
Copy link
Member

cmacknz commented Nov 28, 2022

Looks like we are still seeing this in some of our own scale tests on VMs: elastic/elastic-agent#1790 (comment)

@AndersonQ
Copy link
Member

a bit of a wild thought, I haven't any data to support it, but could it be related to the panic: close of closed channel... we've seen, but so far haven't caught any impact of it?

@cmacknz
Copy link
Member

cmacknz commented Dec 6, 2022

a bit of a wild thought, I haven't any data to support it, but could it be related to the panic: close of closed channel... we've seen, but so far haven't caught any impact of it?

I think this is definitely possible. Talking with @fearful-symmetry we should actually investigate if we even need the Beats data path locks under agent, since they cannot be installed with overlapping data paths by agent. We can hopefully just bypass the file locking when a Beat detects it is run with the management.enabled flag set.

It would be much simpler to just delete the possibility of any future file lock problems.

@cmacknz
Copy link
Member

cmacknz commented Dec 13, 2022

Closing, we have removed use of these lock files when Beats are run under agent in 8.7.

@cmacknz cmacknz closed this as completed Dec 13, 2022
@threatangler-jp
Copy link

Closing, we have removed use of these lock files when Beats are run under agent in 8.7.

@cmacknz Is there a workaround for those who cannot yet upgrade to 8.7?

@bm11100
Copy link

bm11100 commented May 16, 2023

Closing, we have removed use of these lock files when Beats are run under agent in 8.7.

@cmacknz Is there a workaround for those who cannot yet upgrade to 8.7?

same question @cmacknz, its affecting about 400 of our devices on 8.5

@cmacknz
Copy link
Member

cmacknz commented May 16, 2023

Manually deleting the lock files before Filebeat starts should be the work around here. Wrapping agent or filebeat in a script that unconditionally deletes them is one way, they aren't actually necessary.

@threatangler-jp
Copy link

@cmacknz is there a PowerShell command you could provide to do this workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team v8.5.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.