Parallelize stat calls for filesystem collector #1760

mknapphrt · 2020-06-22T16:24:26Z

Host operating system: output of `uname -a`

Linux shsys5 4.19.31.hrtdev #1 SMP Mon Mar 25 11:22:57 EDT 2019 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

node_exporter, version 0.18.1 (branch: release-0.18.1-hrt, revision: d7b79cb7553db72afa2de320dd0bd7f8d1fa531a)

node_exporter command line flags

ExecStart=/usr/sbin/node_exporter \
    --collector.arp \
    --collector.cpu \
    --collector.diskstats \
    --collector.edac \
    --collector.filefd \
    --collector.filesystem \
    --collector.hwmon \
    --collector.loadavg \
    --collector.mdadm \
    --collector.meminfo \
    --collector.mountstats \
    --collector.netdev \
    --collector.netstat \
    --collector.sockstat \
    --collector.stat \
    --collector.systemd \
    --collector.systemd.enable-restarts-metrics \
    --collector.tcpstat \
    --collector.textfile \
    --collector.uname \
    --collector.vmstat \
    --collector.zfs \
    --no-collector.bcache \
    --no-collector.conntrack \
    --no-collector.infiniband \
    --no-collector.interrupts \
    --no-collector.ipvs \
    --no-collector.wifi \
    --no-collector.xfs \
    --no-collector.nfs \
    --no-collector.nfsd \
    --collector.textfile.directory /var/lib/node_exporter/textfile_collector \
    --collector.systemd.unit-blacklist=".*\\.(device|swap|scope|slice)$" \
    --collector.filesystem.ignored-mount-points="^/(sys|proc|dev|usr/scratch/core)($|/)" \
    --collector.diskstats.ignored-devices="^(sr|ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p|sr)\\d+$" \
    --collector.filesystem.ignored-fs-types="^beegfs_nodev|beegfs|binfmt_misc|cgroup|devpts|fusectl|mqueue|proc|pstore|(auto|debug|devtmp|hugetlb|rpc_pipe|sys|tmp|trace)fs$" \
    --collector.vmstat.fields="^(oom_kill|pgpg|pswp|pg.*fault|pgsteal|pgscan|hugetlb).*"

Are you running node_exporter in Docker?

Nope

This is meant to bring up a discussion about parallelizing the stat calls done in the filesystem collector. For a couple of our filers we've got several mountpoints from the same filter. This means that when the remote filter is overloaded, each one of those stat call for that mountpoint takes a long time. In a few cases this has lead to scrape timeouts. This is especially problematic when the mount is just on the edge of becoming stale because it's never marked as stale and subsequent scrapes continue to timeout.

Any opinions on doing this? Thanks :)

The text was updated successfully, but these errors were encountered:

SuperQ · 2020-06-22T17:51:21Z

Thanks for the idea. In general, I'm open to this. We have done similar things in other collectors where metric retrieval is slow.

Slightly off-topic. It's been a long debated "Should we filter network filesystems from the node_exporter by default". In most cases, it doesn't make sense to monitor network filesystems from nodes. But it's not always possible to monitor the filesystem on the fileserver as easily. This is why we've left this enabled by default. Unlike the other items in the ignored-fs-types list, network filesystems are "real", even if they are not local to the node_exporter itself.

(Where I work, we drop NFS from our node_exporter)

discordianfish · 2020-06-25T14:59:16Z

Not opposed but not convinced either. I like this to be simple as possible. Probably not an issue but this introduces some complexity in code and raises questions like how many calls do we want to do in parallel? Is it possible that doing a lot in parallel causes some contention in the kernel?

Sircular · 2020-06-29T17:04:18Z

@SuperQ Do you have an example of another collector you've done this for? I'm looking at possibly taking on this issue.

SuperQ · 2020-06-29T18:40:52Z

The systemd collector systemd_linux.go uses a simple sync group. There is also the system_cpu.go in the procfs library that uses an error group in order to capture all the error results of each goroutine.

pmb311 · 2020-07-10T18:34:05Z

@discordianfish Can we get this one into the next release? It's very important for production environments where one might have a couple dozen NFS mounts and missed scrapes add up a lot when say 2 of them are slow.

nayfield · 2023-09-18T13:22:39Z

Looks like this can be closed after a release of #1772 which has been merged to master

Sircular mentioned this issue Jun 30, 2020

Parallelize stat calls in Linux filesystem collector. #1772

Merged

samuelnoguchi mentioned this issue Aug 2, 2021

Parallelize filesystem_linux stat calls #2108

Closed

SuperQ closed this as completed Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize stat calls for filesystem collector #1760

Parallelize stat calls for filesystem collector #1760

mknapphrt commented Jun 22, 2020

SuperQ commented Jun 22, 2020

discordianfish commented Jun 25, 2020

Sircular commented Jun 29, 2020

SuperQ commented Jun 29, 2020

pmb311 commented Jul 10, 2020

nayfield commented Sep 18, 2023

Parallelize stat calls for filesystem collector #1760

Parallelize stat calls for filesystem collector #1760

Comments

mknapphrt commented Jun 22, 2020

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

SuperQ commented Jun 22, 2020

discordianfish commented Jun 25, 2020

Sircular commented Jun 29, 2020

SuperQ commented Jun 29, 2020

pmb311 commented Jul 10, 2020

nayfield commented Sep 18, 2023

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`