Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize stat calls for filesystem collector #1760

Closed
mknapphrt opened this issue Jun 22, 2020 · 6 comments
Closed

Parallelize stat calls for filesystem collector #1760

mknapphrt opened this issue Jun 22, 2020 · 6 comments

Comments

@mknapphrt
Copy link
Contributor

Host operating system: output of uname -a

Linux shsys5 4.19.31.hrtdev #1 SMP Mon Mar 25 11:22:57 EDT 2019 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 0.18.1 (branch: release-0.18.1-hrt, revision: d7b79cb7553db72afa2de320dd0bd7f8d1fa531a)

node_exporter command line flags

ExecStart=/usr/sbin/node_exporter \
    --collector.arp \
    --collector.cpu \
    --collector.diskstats \
    --collector.edac \
    --collector.filefd \
    --collector.filesystem \
    --collector.hwmon \
    --collector.loadavg \
    --collector.mdadm \
    --collector.meminfo \
    --collector.mountstats \
    --collector.netdev \
    --collector.netstat \
    --collector.sockstat \
    --collector.stat \
    --collector.systemd \
    --collector.systemd.enable-restarts-metrics \
    --collector.tcpstat \
    --collector.textfile \
    --collector.uname \
    --collector.vmstat \
    --collector.zfs \
    --no-collector.bcache \
    --no-collector.conntrack \
    --no-collector.infiniband \
    --no-collector.interrupts \
    --no-collector.ipvs \
    --no-collector.wifi \
    --no-collector.xfs \
    --no-collector.nfs \
    --no-collector.nfsd \
    --collector.textfile.directory /var/lib/node_exporter/textfile_collector \
    --collector.systemd.unit-blacklist=".*\\.(device|swap|scope|slice)$" \
    --collector.filesystem.ignored-mount-points="^/(sys|proc|dev|usr/scratch/core)($|/)" \
    --collector.diskstats.ignored-devices="^(sr|ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\\d+n\\d+p|sr)\\d+$" \
    --collector.filesystem.ignored-fs-types="^beegfs_nodev|beegfs|binfmt_misc|cgroup|devpts|fusectl|mqueue|proc|pstore|(auto|debug|devtmp|hugetlb|rpc_pipe|sys|tmp|trace)fs$" \
    --collector.vmstat.fields="^(oom_kill|pgpg|pswp|pg.*fault|pgsteal|pgscan|hugetlb).*"

Are you running node_exporter in Docker?

Nope

This is meant to bring up a discussion about parallelizing the stat calls done in the filesystem collector. For a couple of our filers we've got several mountpoints from the same filter. This means that when the remote filter is overloaded, each one of those stat call for that mountpoint takes a long time. In a few cases this has lead to scrape timeouts. This is especially problematic when the mount is just on the edge of becoming stale because it's never marked as stale and subsequent scrapes continue to timeout.

Any opinions on doing this? Thanks :)

@SuperQ
Copy link
Member

SuperQ commented Jun 22, 2020

Thanks for the idea. In general, I'm open to this. We have done similar things in other collectors where metric retrieval is slow.

Slightly off-topic. It's been a long debated "Should we filter network filesystems from the node_exporter by default". In most cases, it doesn't make sense to monitor network filesystems from nodes. But it's not always possible to monitor the filesystem on the fileserver as easily. This is why we've left this enabled by default. Unlike the other items in the ignored-fs-types list, network filesystems are "real", even if they are not local to the node_exporter itself.

(Where I work, we drop NFS from our node_exporter)

@discordianfish
Copy link
Member

Not opposed but not convinced either. I like this to be simple as possible. Probably not an issue but this introduces some complexity in code and raises questions like how many calls do we want to do in parallel? Is it possible that doing a lot in parallel causes some contention in the kernel?

@Sircular
Copy link

@SuperQ Do you have an example of another collector you've done this for? I'm looking at possibly taking on this issue.

@SuperQ
Copy link
Member

SuperQ commented Jun 29, 2020

The systemd collector systemd_linux.go uses a simple sync group. There is also the system_cpu.go in the procfs library that uses an error group in order to capture all the error results of each goroutine.

@pmb311
Copy link

pmb311 commented Jul 10, 2020

@discordianfish Can we get this one into the next release? It's very important for production environments where one might have a couple dozen NFS mounts and missed scrapes add up a lot when say 2 of them are slow.

@nayfield
Copy link

Looks like this can be closed after a release of #1772 which has been merged to master

@SuperQ SuperQ closed this as completed Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants