Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk input not reporting metrics for all mounted disk #1544

Closed
nyxcharon opened this issue Jul 25, 2016 · 15 comments · Fixed by #3529
Closed

Disk input not reporting metrics for all mounted disk #1544

nyxcharon opened this issue Jul 25, 2016 · 15 comments · Fixed by #3529
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@nyxcharon
Copy link

Bug report

The disk plugin is not reporting metrics for a mounted disk, but the diskio plugin does.

Relevant telegraf.conf:

# Set Tag Configuration
[tags]
# Set Agent Configuration
[agent]
  interval = "10s"
  round_interval = true
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  debug = false
  quiet = true
  flush_buffer_when_full = true
  hostname = "hostname"
# Set output configuration
[[outputs.influxdb]]
  urls = ["http://<ip removed>:8086"]
  database = "telegraf"
  precision = "ns"
  timeout = "5s"

# Set Input Configuration
[[inputs.netstat]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.mem]]
[[inputs.cpu]]
  percpu = true
  totalcpu = true
[[inputs.disk]]
[[inputs.diskio]]
[[inputs.net]]
[[inputs.prometheus]]
  urls = ["<url 1>", "<url2>"]
  insecure_skip_verify = true
  bearer_token = "/var/run/secrets/kubernetes.io/serviceaccount/token"

System info:

Telegraf - version 1.0.0-beta2-18-g755b2ec
CoreOS stable (1010.6.0)
Docker version 1.10.3, build 8acee1b
Telegraf Docker container: quay.io/deis/telegraf:v2.1.0

Output of "mount" from inside the docker container (trimmed down to remove tmpfs mounts)
The disk of interest is /dev/xvdba.

proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,context="system_u:object_r:svirt_lxc_file_t:s0:c576,c784",gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime,seclabel)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime,seclabel)
/dev/xvda9 on /hostfs type ext4 (ro,relatime,seclabel,data=ordered)
/dev/xvda3 on /hostfs/usr type ext4 (ro,relatime,seclabel,block_validity,delalloc,barrier,user_xattr,acl)
/dev/xvda6 on /hostfs/usr/share/oem type ext4 (rw,nodev,relatime,seclabel,commit=600,data=ordered)
/dev/xvda1 on /hostfs/boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro)
/dev/xvdba on /hostfs/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-west-2a/vol-bd4c4165 type ext4 (rw,relatime,seclabel,data=ordered)
/dev/xvdba on /hostfs/var/lib/kubelet/pods/ac9be55d-4eb4-11e6-8baa-0a836f4f06a7/volumes/kubernetes.io~aws-ebs/grafanadata type ext4 (rw,relatime,seclabel,data=ordered)
/dev/xvda9 on /dev/termination-log type ext4 (rw,relatime,seclabel,data=ordered)
/dev/xvda9 on /etc/resolv.conf type ext4 (rw,relatime,seclabel,data=ordered)
/dev/xvda9 on /etc/hostname type ext4 (rw,relatime,seclabel,data=ordered)
/dev/xvda9 on /etc/hosts type ext4 (rw,relatime,seclabel,data=ordered)

The docker container is run with the following environment variables set (it's launched via kubernetes which is why this is yaml)

          - name: "INFLUXDB_URLS"
            value: http://<ip>:8086
          - name: "INFLUXDB_DATABASE"
            value: "telegraf"
          - name: "HOST_PROC"
            value: "/rootfs/proc"
          - name: "HOST_SYS"
            value: "/rootfs/sys"
          - name: "AGENT_QUIET"
            value: "true"
          - name: "ENABLE_PROMETHEUS"
            value: "true"
          - name: "HOST_MOUNT_PREFIX"
            value: "/hostfs"
          - name: "HOST_ETC"
            value: "/hostfs/etc"

Steps to reproduce:

Run docker container with the above environment variables set

Expected behavior:

To be able to query influxdb and see the disk stats on "/dev/xvdba on /hostfs/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-west-2a/vol-bd4c4165 type ext4 (rw,relatime,seclabel,data=ordered)"

Actual behavior:

No disk stats are available

Additional info:

I can query the diskio stats for this disk in telegraf. The query
select * from disk where time > now() - 30s and "host" = '<ip>' returns

2016-07-25T17:26:40Z    "<ip>"      116618  "xvdba" 12596224    7390    2365    "unknown"       3202019328  360804  356811
@nyxcharon nyxcharon changed the title Disk plugin not reporting metrics for all mounted disk Disk input not reporting metrics for all mounted disk Jul 25, 2016
@sparrc sparrc added this to the Future Milestone milestone Nov 8, 2016
@sparrc sparrc added the bug unexpected problem or unintended behavior label Nov 8, 2016
@j-vizcaino
Copy link
Contributor

The problem is that HOST_MOUNT_PREFIX does not work as expected.
According to the code, it is added as a prefix to the mount paths gathered by the ps package, expecting that the mount paths would be relative to the host root (not the container).

This is the problem: if you cat /hostfs/etc/mtab you can see all the mount points of the host, but relative to container mount point.
Example: if host mount point is /foo, ps.Partitions() will return /hostfs/foo in container.
Therefore, there is no need to add the HOST_MOUNT_PREFIX before issuing the os.Stat() call.

@nyxcharon Try removing the HOST_MOUNT_PREFIX and you will see all the mount points appear.

@sparrc
Copy link
Contributor

sparrc commented Nov 15, 2016

thanks for tracking that down @j-vizcaino, is this just a documentation issue then? care to submit a PR?

@j-vizcaino
Copy link
Contributor

Problem is in the code: it prepends HOST_MOUNT_PREFIX to the mount points gathered via ps.Partitions() whereas it should strip the prefix from the paths.
I will try to find some time to submit a PR to fix this.

@sparrc sparrc modified the milestones: 1.3.0, Future Milestone Nov 15, 2016
@j-vizcaino
Copy link
Contributor

Digging more, it seems the problem is a bit more complex.
The package used to query all partitions uses the etc/mtab file. In our case, this is /hostfs/etc/mtab which should be fine because it is the same file as the host.
Things start to get tricky when etc/mtab is a symlink. In CoreOS, /etc/mtab → ../proc/self/mounts. In Debian, /etc/mtab → /proc/mounts. In these cases, the parsed mtab file is the mounts of the container, not the host.

My previous comment can be discarded: mount points should appear in /foo (from the ps.Partitions() point of view) whereas the effective mount point inside the container is /hostfs/foo.

This should be addressed in https://github.com/shirou/gopsutil by opening (/hostfs)/proc/mounts directly (mtab is deprecated anyway) and everything should work.

@j-vizcaino
Copy link
Contributor

This should be addressed in https://github.com/shirou/gopsutil by opening (/hostfs)/proc/mounts directly (mtab is deprecated anyway) and everything should work.

I was wrong again.
The following applies: /etc/mtab/proc/mounts/proc/self/mounts
Running tests on CoreOS 1068, within a Docker container having / bound to /hostfs, issuing cat /proc/mounts gives the exact same result as cat /hostfs/proc/mounts, that is, mount points appearing prefixed with /hostfs. The only way to effectively get the mount point of the host would be to cat /hostfs/proc/1/mounts but this is ugly.

@m4ce
Copy link
Contributor

m4ce commented Jan 6, 2017

@j-vizcaino, did you find a solution for this? I tracked it down to the same issue you are describing, the problem is really mtab being a symlink to /proc/self/mounts :(

lrwxrwxrwx    1 root     root          17 Dec  1 04:32 /hostfs/etc/mtab -> /proc/self/mounts

@m4ce
Copy link
Contributor

m4ce commented Jan 6, 2017

A workaround:

docker run --rm -v /:/hostfs:ro -e HOST_MOUNT_PREFIX=/hostfs -e HOST_ETC=/foo/etc -e HOST_PROC=/hostfs/proc -e HOST_SYS=/hostfs/sys -v /proc/1/mounts:/foo/etc/mtab -it telegraf:1.1.2 ..

HostEtc is only used in a handlful of places: https://github.com/shirou/gopsutil/search?utf8=%E2%9C%93&q=HostEtc

@sparrc sparrc modified the milestones: Future Milestone, 1.3.0 Feb 9, 2017
@sparrc sparrc added the help wanted Request for community participation, code, contribution label Feb 9, 2017
@j-vizcaino
Copy link
Contributor

@m4ce Sorry for the late reply. In our current deployment of telegraf, we

  • bind mount host //hostfs in container
  • set env HOST_ETC=/rootfs/etc
  • set env HOST_SYS=/rootfs/sys
  • set env HOST_ETC=/rootfs/proc

HOST_MOUNT_PREFIX is not set in our case: this leads to mounts points being prefixed by /rootfs/ but this is the only working solution we could come to.

Hope this helps.

@johnseekins
Copy link
Contributor

@j-vizcaino Trying to follow your example:

Container:

docker run -it --net=host --privileged=true -v /:/rootfs:ro -e "HOST_SYS=/rootfs/sys" -e "HOST_PROC=/rootfs/proc" -e "HOST_ETC=/rootfs/etc" <telegraf 1.2.1 image> sh

In the container:

/ # env
HOST_PROC=/rootfs/proc
HOSTNAME=...
SHLVL=1
OLDPWD=/config
HOME=/root
TERM=xterm
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOST_SYS=/rootfs/sys
HOST_ETC=/rootfs/etc
PWD=/

Then...trying to run telegraf:

/ # ./telegraf --config /config/telegraf.conf --test
* Plugin: inputs.diskio, Collection 1
...
* Plugin: inputs.kernel_vmstat, Collection 1
...
* Plugin: inputs.disk, Collection 1
2017-03-01T17:55:42Z E! error getting disk usage info: too many levels of symbolic links

What am I doing wrong here?

@johnseekins
Copy link
Contributor

Ah. Running with "--privileged=true" breaks this behaviour.

@j-vizcaino
Copy link
Contributor

@johnseekins The too many levels of symbolic links problems is usually caused by /proc/sys/fs/binfmt_misc, not being mounted when Telegraf is started. Depending on your host OS, you need to make sure this is enabled.

@johnseekins
Copy link
Contributor

Strange how that was failing, and now that you've mentioned the needed mount, it is working...without my having mounted anything new...

Magic!

Anyway...it looks good to me. And that seems like a reasonable work-around...should the README be updated?

@johnseekins
Copy link
Contributor

Another, related question...

We now get some stats about /etc/hosts, /etc/resolv.conf, /etc/hostname and such from within the container. These are all duplicates of each other, too. Any magic tricks to get rid of these duplicates?

@danielnelson danielnelson removed the help wanted Request for community participation, code, contribution label Nov 30, 2017
@danielnelson danielnelson added this to the 1.4.5 milestone Nov 30, 2017
@mleonhard
Copy link

With Telegraf 1.13.4, this is all you need to get inputs.disk to report on host filesystems:

  1. Mount /proc at /hostfs/proc in the container
  2. Set HOST_PROC=/hostfs/proc environment variable. This makes the gopsutil library read the host's proc instead of the container's proc.
  3. Mount each filesystem mount point into the container, under /hostfs/. For example, if you want to monitor /mnt/volume1 then mount it into the container at /hostfs/mnt/volume1.
  4. Set HOST_MOUNT_PREFIX=/hostfs to make Telegraf remove the /hostfs prefix from value of the path field it reports.

Working example:

root@staging19:~# df -h |grep /dev
udev            481M     0  481M   0% /dev
/dev/vda1        25G  2.6G   22G  11% /
tmpfs           493M     0  493M   0% /dev/shm
/dev/vda15      105M  3.6M  101M   4% /boot/efi
/dev/sdb        888M   21M  801M   3% /mnt/staginggrafana
/dev/sda        888M   31M  790M   4% /mnt/staginginfluxdb
root@staging19:~# docker run --tty --interactive --rm \
--volume /root/telegraf.conf:/etc/telegraf/telegraf.conf \
--volume /mnt:/hostfs/mnt \
--env HOST_MOUNT_PREFIX=/hostfs \
telegraf@sha256:490e2976a5890ae6474fe36cb44764c81f17215396647c0ddae09b04e47a30b6 2>&1
2020-03-12T21:47:07Z I! Starting Telegraf 1.13.4
2020-03-12T21:47:07Z I! Using config file: /etc/telegraf/telegraf.conf
2020-03-12T21:47:07Z I! Loaded inputs: disk
2020-03-12T21:47:07Z I! Loaded aggregators: 
2020-03-12T21:47:07Z I! Loaded processors: printer
2020-03-12T21:47:07Z I! Loaded outputs: discard
2020-03-12T21:47:07Z I! Tags enabled: host=13257ee4702d
2020-03-12T21:47:07Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"13257ee4702d", Flush Interval:10s
disk,host=13257ee4702d,path=/ used_percent=10.592189563618062 1584049630000000000
disk,host=13257ee4702d,path=/mnt/staginggrafana used_percent=2.49834601782969 1584049630000000000
disk,host=13257ee4702d,path=/mnt/staginginfluxdb used_percent=3.7496608741593245 1584049630000000000
^C2020-03-12T21:47:12Z I! [agent] Hang on, flushing any cached metrics before shutdown
root@staging19:~# cat telegraf.conf
[agent]
  interval = "10s"

[[outputs.discard]]

[[processors.printer]]

[[inputs.disk]]
  # Read metrics about disk usage by mount point
  # Example tags:
  #   disk
  #   device=sda1
  #   fstype=ext4
  #   host=dev
  #   mode=rw
  #   path=/
  # Example fields:
  #   inodes_total=3907584i
  #   inodes_free=3732435i
  #   inodes_used=175149i
  #   total=62725623808i
  #   free=51542761472i
  #   used=7966146560i
  #   used_percent=13.386477459334872
  #
  # By default stats will be gathered for all mount points.
  # Set mount_points will restrict the stats to only the specified mount points.
  # mount_points = ["/"]
  # Ignore mount points by filesystem type.
  # ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
  ignore_fs = ["tmpfs"]
  
  fieldpass = ["used_percent"]
  taginclude = ["host", "path"]
  [inputs.disk.tagpass]
    path = ["/", "/mnt/*"]
root@staging19:~#

@KirannBhavaraju
Copy link

@m4ce Sorry for the late reply. In our current deployment of telegraf, we

  • bind mount host //hostfs in container
  • set env HOST_ETC=/rootfs/etc
  • set env HOST_SYS=/rootfs/sys
  • set env HOST_ETC=/rootfs/proc

HOST_MOUNT_PREFIX is not set in our case: this leads to mounts points being prefixed by /rootfs/ but this is the only working solution we could come to.

Hope this helps.

I believe the last environment variable you are setting is HOST_PROC. For any future readers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants