Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

procstat can not find pid of process using pattern lookup in dockerized telegraf #6813

Closed
arindamchoudhury opened this issue Dec 18, 2019 · 6 comments · Fixed by #7185
Closed
Assignees
Labels
area/docker bug unexpected problem or unintended behavior
Milestone

Comments

@arindamchoudhury
Copy link

arindamchoudhury commented Dec 18, 2019

Hi,
I am using telegraf 1.13.0.

the telegraf.conf

[global_tags]

# Configuration for telegraf agent
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  omit_hostname = false


# Send telegraf metrics to file(s)
[[outputs.file]]
  ## Files to write to, "stdout" is a specially handled file.
  files = ["stdout", "/tmp/metrics.out"]


[[inputs.procstat]]
  pattern = "redis-server"
  pid_tag = false
  pid_finder = "pgrep"
 [inputs.procstat.tags]
   service = "redis-server"

I started the telegraf with the coomand:

docker run -d --name=telegraf -e HOST_PROC=/rootfs/proc -e HOST_MOUNT_PREFIX=/rootfs -v /:/rootfs -v $PWD/telegraf.conf:/etc/telegraf/telegraf.conf:ro telegraf

It cant find the process. the logs:

# docker logs -f telegraf
2019-12-18T16:54:49Z I! Starting Telegraf 1.13.0
2019-12-18T16:54:49Z I! Using config file: /etc/telegraf/telegraf.conf
2019-12-18T16:54:49Z I! Loaded inputs: procstat
2019-12-18T16:54:49Z I! Loaded aggregators: 
2019-12-18T16:54:49Z I! Loaded processors: 
2019-12-18T16:54:49Z I! Loaded outputs: file
2019-12-18T16:54:49Z I! Tags enabled: host=accffe18c96f
2019-12-18T16:54:49Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"accffe18c96f", Flush Interval:10s
procstat_lookup,host=accffe18c96f,pattern=redis-server,pid_finder=pgrep,result=success,service=redis-server pid_count=0i,running=0i,result_code=0i 1576688090000000000
procstat_lookup,host=accffe18c96f,pattern=redis-server,pid_finder=pgrep,result=success,service=redis-server pid_count=0i,running=0i,result_code=0i 1576688100000000000
procstat_lookup,host=accffe18c96f,pattern=redis-server,pid_finder=pgrep,result=success,service=redis-server pid_count=0i,running=0i,result_code=0i 1576688110000000000

The process exists:
in host:

# pgrep -i redis-server
2135

Inside telegraf docker:

# ls /rootfs/proc/2135/
attr             cmdline          environ          io               mem              ns               pagemap          schedstat        stat             timers
autogroup        comm             exe              limits           mountinfo        numa_maps        personality      sessionid        statm            timerslack_ns
auxv             coredump_filter  fd               loginuid         mounts           oom_adj          projid_map       setgroups        status           uid_map
cgroup           cpuset           fdinfo           map_files        mountstats       oom_score        root             smaps            syscall          wchan
clear_refs       cwd              gid_map          maps             net              oom_score_adj    sched            stack            task
/ # ls /rootfs/proc/2135/status 
/rootfs/proc/2135/status

If I change to pid_finder = "native", still it can not find it.:

@danielnelson danielnelson added area/docker bug unexpected problem or unintended behavior ready labels Dec 18, 2019
@victornet
Copy link
Contributor

victornet commented Jan 10, 2020

Hi - i'm not sure if this is generally a bug in telegraf.

If you use pgrep as finder, it doesn't care about /proc under /rootfs/proc. The native finder seems to supports this.

I tried some research and got stuck at here:
https://github.com/shirou/gopsutil/blob/master/process/process_posix.go#L77

If i get this right, PidExistsWithContext invokes FindProcess and then proc.Signal. But hey, this process doesn't exists in this namespace/container.

After that, error handling does is best and this is silently failing. IMO this is an expected behaviour if you run telegraf in a container.

You can even reproduce this with ps and kill in /bin/bash executed in a container:

[root@dockerhost ~]# docker run -ti --rm --name=telegraf-test \
> -v /proc:/proc telegraf /bin/bash
root@88169189f8ef:/# ps -opid,tty,time,cmd,pidns
  PID TT           TIME CMD                              PIDNS
15468 pts/0    00:00:00 /bin/bash                   4026532164
15621 pts/0    00:00:00 ps -opid,tty,time,cmd,utsns 4026532164
17011 pts/0    00:00:00 -bash                                -
root@88169189f8ef:/# echo $$
1
root@88169189f8ef:/# kill -0 15468
bash: kill: (15468) - No such process
root@88169189f8ef:/# kill -0 $$
root@88169189f8ef:/# 

@i-prudnikov
Copy link
Contributor

i-prudnikov commented Jan 14, 2020

I can confirm that proc.Signal is actually breaks the logic, as it tries to check process existence by signalling process in current process namespace, not respecting the case if plugin runs inside docker conatiner with it's own cgroup. If older version of gopsutil is used where the same functionality realised through checking of PID file existence in /<host_fs>/proc/<PID>, then it works as expected.
Moreover, if the signalling is fixed, it still get some bugs in the definition of executable name.
In used version of gopsutil - 2.19.7, the executable name is detected by following symbolic link /<host_fs>/proc/<PID>/exe, but it appears that some process didn't have this link in a correct state (link is not leading anywhere, that is quite normal). And this PID is not included in the pool.
More reliable way to get process name is to get it via /<host_fs>/proc/<PID>/stat as it done here: https://github.com/mitchellh/go-ps/blob/master/process_linux.go. I've openned issue in gopsutil project - shirou/gopsutil#816

One more problem also is different version of gopsutil stated in Gopkg.lock and Gopkg.toml (known dep issue), checked in master and 1.13.0 tag.

In https://github.com/influxdata/telegraf/blob/master/Gopkg.lock:

[[projects]]
  digest = "1:9024df427b3c8a80a0c4b34e535e5e1ae922c7174e3242b6c7f30ffb3b9f715e"
  name = "github.com/shirou/gopsutil"
  packages = [
    "cpu",
    "disk",
    "host",
    "internal/common",
    "load",
    "mem",
    "net",
    "process",
  ]
  pruneopts = ""
  revision = "fc7e5e7af6052e36e83e5539148015ed2c09d8f9"
  version = "v2.19.11"

In https://github.com/influxdata/telegraf/blob/master/Gopkg.toml:

[[constraint]]
  name = "github.com/shirou/gopsutil"
  version = "2.19.7"

@danielnelson
Copy link
Contributor

@i-prudnikov Thanks for opening the upstream issue, makes sense that these symlinks won't mesh well with the bind mounts and HOST_PROC.

@arindamchoudhury It seems as a workaround one could bind mount the binaries you are monitoring into the container. Can you try using pid_finder = "native" and also mount redis server into the container. I'm not sure if you would need to mount it under /rootfs or /, check the symlink in /rootfs/proc to be sure.

One more problem also is different version of gopsutil stated in Gopkg.lock and Gopkg.toml (known dep issue), checked in master and 1.13.0 tag.

This is setup this way on purpose, to provide more flexibility with what versions can be used. Telegraf can use 2.19.7 or later, but we are compiling with 2.19.11 for the official packages. It's a bit academic but can be useful in practice when the direct dependencies and transitive dependencies both depend on a library.

@i-prudnikov
Copy link
Contributor

Opened a pull request to fix problem with broken symlink /proc/<PID>/exe shirou/gopsutil#820

@arindamchoudhury

This comment has been minimized.

@ssoroka
Copy link
Contributor

ssoroka commented Mar 17, 2020

I managed to replicate the original problem, then tried updating to the latest gopsutil. Can confirm the upstream issue resolved as of github.com/shirou/[email protected]. The only potential issue is shirou/gopsutil#842, but I think it's worth updating anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docker bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants