Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS: io stats removed #2068

Open
chrisrd opened this issue Jul 5, 2021 · 21 comments · Fixed by #2451
Open

ZFS: io stats removed #2068

chrisrd opened this issue Jul 5, 2021 · 21 comments · Fixed by #2451
Labels

Comments

@chrisrd
Copy link

chrisrd commented Jul 5, 2021

Host operating system:

OpenZFS as of commit openzfs/zfs@371f88d

node_exporter version

$ /usr/bin/prometheus-node-exporter --version
node_exporter, version 0.17.0+ds (branch: debian/sid, revision: 0.17.0+ds-3+b11)
  build user:       [email protected]
  build date:       20190608-10:00:14
  go version:       go1.11.6

...through to master, current HEAD (3762191)

node_exporter command line flags

--collector.zfs

What did you do that produced an error?

Upgraded to OpenZFS HEAD openzfs/zfs@HEAD

What did you expect to see?

$ wget -qO - http://server:9100/metrics | grep ^node_zfs_zpool || echo "NOT FOUND"
node_zfs_zpool_nread{zpool="pool"} 2.49897949175808e+14
node_zfs_zpool_nwritten{zpool="pool"} 2.65591973699584e+14
node_zfs_zpool_rcnt{zpool="pool"} 0
node_zfs_zpool_reads{zpool="pool"} 2.729451643e+09
node_zfs_zpool_rlentime{zpool="pool"} 4.728605718663304e+16
node_zfs_zpool_rtime{zpool="pool"} 5.775579362149045e+15
node_zfs_zpool_rupdate{zpool="pool"} 1.3886342123071748e+16
node_zfs_zpool_wcnt{zpool="pool"} 0
node_zfs_zpool_wlentime{zpool="pool"} 2.8960276949205395e+17
node_zfs_zpool_writes{zpool="pool"} 3.436014883e+09
node_zfs_zpool_wtime{zpool="pool"} 1.319409654928622e+15
node_zfs_zpool_wupdate{zpool="pool"} 1.3886342121834806e+16

What did you see instead?

$ wget -qO - http://server:9100/metrics | grep ^node_zfs_zpool || echo "NOT FOUND"
NOT FOUND

Node_exporter uses /proc/spl/kstat/zfs/*/io to collect these stats. This proc file has been removed as of OpenZFS openzfs/zfs@371f88d.

See Also: openzfs/zfs#12212

@discordianfish
Copy link
Member

Ok, looks like someone needs to implement the "more efficient ways .. to obtain detailed statistics".

@chrisrd
Copy link
Author

chrisrd commented Jul 7, 2021

The "more efficient ways" are probably as done by zpool iostat. That uses an zfs ioctl to pull in at least the nread, nwritten, reads and writes part of the stats (i.e. the stuff I'm most interested in). E.g.:

# zpool iostat -p
                        capacity                         operations                         bandwidth           
pool                  alloc             free             read            write             read            write
----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------
pool        126300985387008   71611107612672             1646             3625         17295293          8371479
fast          4126006828544    3845452472832               12               92           516551          8286794
----------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------

(And more stats are available, e.g. -q shows queue stats, -v shows per-vdev stats.)

The question would be how to pull this stuff in:

  • run the external command and parse the output
  • use the libzfs2 functions as zpool iostat does, e.g. zpool_refresh_stats()
  • use the ioctl directly
    I'm not sure if libzfs2 is supposed to be internal only or meant to be used by external projects.

The other fields in the now-defunct "io" file are from struct kstat_io described here:

https://github.com/openzfs/zfs/blob/master/lib/libspl/include/sys/kstat.h#L582

I don't know which of these are more or less useful to try to get.

Sadly, my knowledge of Go is practically nonexistent, as is the time I have to put towards this.

@discordianfish
Copy link
Member

We don't allow running external commands. Technically using libzfs2 might be an option, but we'd try to avoid linking to c libraries. So ideally we can use the ioctl directly but dunno how involved that would be.

@chrisrd
Copy link
Author

chrisrd commented Jul 8, 2021

Looks like it's not straight forward (you've gotta pull apart "nvlists" and ...stuff), but someone's started on it: https://github.com/lorenz/go-zfs

@chrisrd
Copy link
Author

chrisrd commented Jul 8, 2021

There's another "go-zfs" project at https://github.com/mistifyio/go-zfs but that one appears to be a wrapper around the command line utilities so not suitable given "We don't allow running external commands".

@chrisrd
Copy link
Author

chrisrd commented Jul 8, 2021

Huh. And another at https://github.com/bicomsystems/go-libzfs which links to the libzfs c library so less suitable.

@discostur
Copy link

@discordianfish @chrisrd anything new on this? Encountered the same bug after upgrading openzfs to the latest release ...

@chrisrd
Copy link
Author

chrisrd commented Aug 5, 2021

@discostur Nothing new from me - I'm not in a position to fix this.

In the meantime I'm just monitoring one of the disks in my pool with:

irate(node_disk_read_bytes_total{device="sdf"}[5m])
irate(node_disk_written_bytes_total{device="sdf"}[5m])

This obviously doesn't provide actual pool-wide read/write stats, but it's enough to track how busy the pool is over time.

@discordianfish
Copy link
Member

@discostur No unfortunately not. Someone should look into using to get these stats from https://github.com/lorenz/go-zfs. I might get to that eventually but can't commit to when.

@lorenz
Copy link

lorenz commented Aug 14, 2021

FYI I (the go-zfs author) am working on getting a ZFS exporter up and running. Once I have something we can work on integrating it into node_exporter.

@lorenz
Copy link

lorenz commented Sep 13, 2021

I invite everyone here to give feedback on https://github.com/lorenz/zfs_exporter. As we get consensus on which metrics to expose, how and under what names we can start merging this back into node_exporter, but I think we can iterate this faster outside.

@discordianfish
Copy link
Member

Awesome! For the node-exporter we're trying to not break existing metrics before the next major release, so ideally in the version for the node-exporter the change shouldn't change the metrics generated for now. Or it could be a new collector that is disabled by default.

Alternatively, I was also thinking about adding versioned metrics endpoint. But I guess that requires some further discussion which I should finally kick off.

@jakubvojacek
Copy link

Hello

was this resolved somehow, are there new metrics/collector available that exposes node_zfs_zpool_*?

Thanks

@retzkek
Copy link

retzkek commented May 24, 2022

Just a note, since I'm not sure it was clear from the conversation so far, this is breaking the collection of all zpool metrics, since if the */io procfiles aren't found the collector doesn't try any of the others. https://github.com/prometheus/node_exporter/blob/master/collector/zfs_linux.go#L77=

@calliecameron
Copy link

As a quick fix, would it be possible to make the exporter skip over the IO metrics if their paths are missing, to make the other metrics work again? 'state' especially would be useful, since zed's state alerts don't fire for some states if you're on zfs < 2.1.3.

@discordianfish
Copy link
Member

@calliecameron That sounds reasonable to me

@siebenmann
Copy link

As a note, Ubuntu 22.04 has a version of ZFS that lacks the old 'io' kstat file for per-pool IO statistics. Anyone upgrading Ubuntu ZFS machines is no longer getting their ZFS stats through the node_exporter, perhaps unlike what they expect (the stats are certainly important to our operations, and we were looking forward to per-filesystem stats).

barrucadu added a commit to barrucadu/nixfiles that referenced this issue Oct 6, 2022
This fixes an issue where the `node_zfs_zpool_*` metrics disappeared due
to an upstream ZFS change[1].  With these metrics back, I can drop the
monitoring scripts and switch to Alert Manager.

[1]: prometheus/node_exporter#2068
@chrisrd
Copy link
Author

chrisrd commented Feb 6, 2023

I don't agree that #2451 has resolved this issue: it fixes the case that the missing zpool IO metrics would cause the rest of the (otherwise available) zpool metrics to be unavailable, but the original issue was specifically that the zpool IO metrics are no longer available (due to the 'io' kstat file being defunct). These metrics ARE available via a system call. The solution should be to use the system call to make the zpool IO metrics available once again.

@siebenmann
Copy link

Unfortunately, the ZFS IO stats that are now available have changed significantly such that some of the metrics simply can't be generated any more; there is no drop-in replacement system call that could be used to get the traditional metrics (and no way to synthetically generate all of them). An illustration of what's available in the 'new' ZFS IO stats (and implicitly what's missing) can be found in eg siebenmann/zfs_exporter (which we currently use in production). There are enough metrics and enough choices involved that I'm far from confident that they should all be shoved into the node_exporter.

(Some of the 'new' ZFS IO stats are actually present fairly far back, because they're the information 'zpool iostat' will present. The old pool-level IO stats were a separate system that OpenZFS eliminated for reasons; see openzfs/zfs#12212 )

@chrisrd
Copy link
Author

chrisrd commented Feb 7, 2023

Thanks @siebenmann, I'll certainly take a look at siebenmann/zfs_exporter.

For my use-case, I don't specifically need "the same as before" IO stats, just some stats that allow monitoring the pool level IO, so your zfs_exporter should certainly do that.

I still think this ticket should remain open until the 'new' IO pool level stats (such as available via zpool iostat -p) are available in node_exporter.

@s-burris
Copy link

@SuperQ / @discordianfish please consider reopening, hiding the errors does not solve the issue described here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants