Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smartctl_exporter ignores nvme devices by default #210041

Closed
robryk opened this issue Jan 10, 2023 · 4 comments · Fixed by #333961
Closed

smartctl_exporter ignores nvme devices by default #210041

robryk opened this issue Jan 10, 2023 · 4 comments · Fixed by #333961
Labels
0.kind: bug Something is broken

Comments

@robryk
Copy link
Contributor

robryk commented Jan 10, 2023

Describe the bug

smartctl_exporter is completely silent in its metrics about NVMe devices:

$ curl -s http://127.0.0.1:9633/metrics | grep nvme
$ curl -s http://127.0.0.1:9633/metrics | grep sda | head -1
smartctl_device{ata_additional_product_id="unknown",ata_version="ACS-4 (minor revision not indicated)",device="sda",firmware_version="SN03",form_factor="3.5 inches",interface="sat",model_family="Seagate Exos X16",model_name="ST16000NM001G-2KK103",protocol="ATA",sata_version="SATA 3.3",serial_number="ZL2PVCR3"} 1
$ ls -l /dev/sda /dev/nvme0*
crw------- 1 root root 249, 0 Jan 10 00:35 /dev/nvme0
brw-rw---- 1 root disk 259, 0 Jan 10 00:35 /dev/nvme0n1
brw-rw---- 1 root disk 259, 1 Jan 10 00:35 /dev/nvme0n1p1
brw-rw---- 1 root disk 259, 2 Jan 10 00:35 /dev/nvme0n1p2
brw-rw---- 1 root disk   8, 0 Jan 10 00:35 /dev/sda

When you look at its log, you can see that it complains about being unable to open the device:

# journalctl -u prometheus-smartctl-exporter.service | tail
Jan 10 13:59:05 howl smartctl_exporter[3692782]: ts=2023-01-10T12:59:05.238Z caller=readjson.go:155 level=error msg="Smartctl open device: /dev/nvme0 failed: Permission denied"
Jan 10 13:59:28 howl smartctl_exporter[3692782]: ts=2023-01-10T12:59:28.883Z caller=readjson.go:69 level=warn msg="S.M.A.R.T. output reading" err="exit status 2"
Jan 10 13:59:28 howl smartctl_exporter[3692782]: ts=2023-01-10T12:59:28.883Z caller=readjson.go:123 level=error msg="Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode"
Jan 10 13:59:28 howl smartctl_exporter[3692782]: ts=2023-01-10T12:59:28.883Z caller=readjson.go:155 level=error msg="Smartctl open device: /dev/nvme0 failed: Permission denied"
Jan 10 13:59:54 howl smartctl_exporter[3692782]: ts=2023-01-10T12:59:54.804Z caller=readjson.go:69 level=warn msg="S.M.A.R.T. output reading" err="exit status 2"
Jan 10 13:59:54 howl smartctl_exporter[3692782]: ts=2023-01-10T12:59:54.804Z caller=readjson.go:123 level=error msg="Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode"
Jan 10 13:59:54 howl smartctl_exporter[3692782]: ts=2023-01-10T12:59:54.804Z caller=readjson.go:155 level=error msg="Smartctl open device: /dev/nvme0 failed: Permission denied"
Jan 10 14:00:54 howl smartctl_exporter[3692782]: ts=2023-01-10T13:00:54.801Z caller=readjson.go:69 level=warn msg="S.M.A.R.T. output reading" err="exit status 2"
Jan 10 14:00:54 howl smartctl_exporter[3692782]: ts=2023-01-10T13:00:54.801Z caller=readjson.go:123 level=error msg="Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode"
Jan 10 14:00:54 howl smartctl_exporter[3692782]: ts=2023-01-10T13:00:54.801Z caller=readjson.go:155 level=error msg="Smartctl open device: /dev/nvme0 failed: Permission denied"

Steps To Reproduce

Steps to reproduce the behavior:

  1. Have an nvme device
  2. Set prometheus.exporters.smartctl.enable = true;
  3. Notice nothing about the NVMe device present in http://127.0.0.1:9633/metrics
  4. Notice errors in journalctl -u prometheus-smartctl-exporter.service

Expected behavior

I would expect the NVMe device to have its smart data collected. I don't have an opinion on whether they should be collected from the top-level device (e.g. /dev/nvme0) or at the namespace level (e.g. /dev/nvme0n1).

Additional context

smartctl_exporter gets started as a user that doesn't have read access to top-level nvme devices, which causes it to completely ignore them (i.e. behave as if they didn't exist; see #91 on smartctl_exporter for the bug about it being silent). It does have access to devices at namespace level, but its device auto-detection detects the top-level one.

It seems that smartctl claims that the "correct" level to query is /dev/nvme0 (see smartctl --scan, even when smartctl -a /dev/nvme0n1 also works.

The reason the exporter can query all the other devices is because it runs with disk in supplementary groups:

# systemctl cat prometheus-smartctl-exporter.service 
(...)
[Service]
(...)
SupplementaryGroups=disk

As shown above, top-level NVMe devices are owned by root:root. I guess this might be caused by them being char (as opposed to block) devices.

Notify maintainers

@mweinelt @Frostman

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.80, NixOS, 23.05 (Stoat), 23.05.git.d8b095dabd9`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.12.0`
 - channels(root): `""`
 - channels(robryk): `""`
 - nixpkgs: `/etc/nixpkgs`

My nixpkgs version is c4e1db0 with own patches on top.

@robryk robryk added the 0.kind: bug Something is broken label Jan 10, 2023
@mweinelt
Copy link
Member

#205165

@NiceGuyIT
Copy link

Hi @robryk, the latest version of smartctl_exporter identifies NVMe drives.

$ xh --body :19633/metrics | rg nvme
smartctl_device{ata_additional_product_id="unknown",ata_version="",device="nvme0",firmware_version="001C",form_factor="",interface="nvme",model_family="unknown",model_name="INTEL SSDPEKKW512G8",protocol="NVMe",sata_version="",serial_number="BTHH807006AA512D"} 1
smartctl_device_available_spare{device="nvme0"} 100
smartctl_device_available_spare_threshold{device="nvme0"} 10
smartctl_device_block_size{blocks_type="logical",device="nvme0"} 512
smartctl_device_block_size{blocks_type="physical",device="nvme0"} 0
smartctl_device_bytes_read{device="nvme0"} 6.1971011072e+13
smartctl_device_bytes_written{device="nvme0"} 2.9831524352e+13
smartctl_device_capacity_blocks{device="nvme0"} 1.000215216e+09
smartctl_device_capacity_bytes{device="nvme0"} 5.12110190592e+11
smartctl_device_critical_warning{device="nvme0"} 0
smartctl_device_interface_speed{device="nvme0",speed_type="current"} 0
smartctl_device_interface_speed{device="nvme0",speed_type="max"} 0
smartctl_device_media_errors{device="nvme0"} 0
smartctl_device_num_err_log_entries{device="nvme0"} 0
smartctl_device_percentage_used{device="nvme0"} 9
smartctl_device_power_cycle_count{device="nvme0"} 53
smartctl_device_power_on_seconds{device="nvme0"} 1.572588e+08
smartctl_device_smart_status{device="nvme0"} 1
smartctl_device_smartctl_exit_status{device="nvme0"} 0
smartctl_device_temperature{device="nvme0",temperature_type="current"} 41
# HELP smartctl_device_total_capacity_bytes NVMe device capacity bytes (only for devices with multiple NVMe namespaces)
smartctl_device_total_capacity_bytes{device="nvme0"} 0

However, it does need to be run by root. A little Googling lead me to Why smartctl could not be run without root. They concluded root was required and they went so far as to open an issue with smartmontools. After digging through the smartmontools tickets, I found the ticket the blogger opened.

RFE: add O_RDRW mode for sat/scsi/ata devices

According to function blk_verify_command() from current kernel sources (see ​block/scsi_ioctl.c), O_RDONLY or O_RDWR make no difference if device was opened as root (or with CAP_SYS_RAWIO).

The SCSI commands listed in function blk_set_cmd_filter_defaults() show that some of the smartctl -d scsi functionality might work with O_RDONLY for non-root users. Some more might work with O_RDWR.

But smartctl -d sat (to access SATA devices) won't work at all because the SCSI commands ATA_12 and ATA_16 (see ​scsi_proto.h) are always blocked for non-root users.

@robryk
Copy link
Contributor Author

robryk commented Aug 27, 2023

Please note that we have smartctl running as non-root and collecting information from SATA devices. I don't know why your link claims it doesn't work (maybe it used not to?), but it does work now, so I won't try to investigate whether it used not to.

The issue here is that smartctl while running as non-root and without access to open top-level NVMe devices won't collect from NVMe. The reason it doesn't have access to top-level NVMe devices is because they are not group-owned by disk in the default setup created by built-in udev rules.

@mweinelt
Copy link
Member

mweinelt commented Aug 27, 2023

Let's add a group named rawio that udev assigns permissions for the nvme char device to, as was proposed in the upstream systemd issue.

I'm opposed to making the exporter run as root, if we don't have to, sorry.

github-actions bot pushed a commit that referenced this issue Sep 19, 2024
smartctl_exporter already runs with SupplementaryGroups "disk", which
gives full access to SATA drives, but NVMe devices are owned by
root:root, resulting in no access:

  [...] msg="Smartctl open device: /dev/nvme0 failed: Permission denied"

This patch introduces a "smartctl-exporter-access" supplementary
group, and an udev rule with setfacl to give the exporter access to NVMe
drives, without changing the base root:root ownership.

Fixes #210041

(cherry picked from commit 86a6ef5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0.kind: bug Something is broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants