Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For an nvme device, thinkfan errors about device not having sensors, despite hwmon* directories being present #157

Closed
griwes opened this issue Oct 10, 2021 · 8 comments
Assignees
Labels
bug Behavior not as intended usability Unnecessarily confusing or complicated user interaction
Milestone

Comments

@griwes
Copy link

griwes commented Oct 10, 2021

I have a system with two nvme drives, their temp_input paths are as follows:

/sys/devices/pci0000:00/0000:00:1b.0/0000:02:00.0/nvme/nvme0/hwmon0/temp1_input
/sys/devices/pci0000:00/0000:00:1b.0/0000:02:00.0/nvme/nvme0/hwmon0/temp2_input
/sys/devices/pci0000:00/0000:00:1d.0/0000:55:00.0/nvme/nvme1/hwmon1/temp1_input
/sys/devices/pci0000:00/0000:00:1d.0/0000:55:00.0/nvme/nvme1/hwmon1/temp2_input

but the indexing of the hwmon changes on boot in ways that I do not feel like finding about. I was hoping that the following config would catch these:

- hwmon: /sys/devices/pci0000:00/0000:00:1b.0/0000:02:00.0/nvme/nvme0
  incides: [1,2]

- hwmon: /sys/devices/pci0000:00/0000:00:1d.0/0000:55:00.0/nvme/nvme1
  indices: [1,2]

but it doesn't work:

Oct 10 14:57:51 <hostname> thinkfan[2336483]: ERROR: Could not find an `hwmon*' directory or `temp*_input' file in /sys/devices/pci0000:00/0000:00:1d.0/0000:55:00.0/nvme/nvme1/device.

It appears that the hwmon logic searches for directories that start with both hwmon and device, but this is clearly incorrect in this case. For reference, the contents of /sys/devices/pci0000:00/0000:00:1d.0/0000:55:00.0/nvme/nvme1 are as follows:

-r--r--r--  1 root root 4096 Sep 30 23:13 address
-r--r--r--  1 root root 4096 Sep 30 23:13 cntlid
-r--r--r--  1 root root 4096 Sep 30 23:13 dev
lrwxrwxrwx  1 root root    0 Sep 30 18:51 device -> ../../../0000:55:00.0
-r--r--r--  1 root root 4096 Sep 30 18:51 firmware_rev
drwxr-xr-x  3 root root    0 Sep 30 18:51 hwmon1
-r--r--r--  1 root root 4096 Sep 30 23:13 kato
-r--r--r--  1 root root 4096 Sep 30 18:51 model
drwxr-xr-x  3 root root    0 Sep 30 18:51 ng1n1
-r--r--r--  1 root root 4096 Sep 30 23:13 numa_node
drwxr-xr-x 11 root root    0 Sep 30 18:51 nvme1n1
drwxr-xr-x  2 root root    0 Sep 30 23:13 power
-r--r--r--  1 root root 4096 Sep 30 23:13 queue_count
--w-------  1 root root 4096 Sep 30 23:13 rescan_controller
--w-------  1 root root 4096 Sep 30 23:13 reset_controller
-r--r--r--  1 root root 4096 Sep 30 18:51 serial
-r--r--r--  1 root root 4096 Sep 30 23:13 sqsize
-r--r--r--  1 root root 4096 Sep 30 23:13 state
-r--r--r--  1 root root 4096 Sep 30 23:13 subsysnqn
lrwxrwxrwx  1 root root    0 Sep 30 18:51 subsystem -> ../../../../../../class/nvme
-r--r--r--  1 root root 4096 Sep 30 23:13 transport
-rw-r--r--  1 root root 4096 Sep 30 18:51 uevent

Note hwmon1 is there, but the config parser decides to error out because device is not a hwmon directory, which is a faulty logic somewhere in the config parser.

I can't use a workaround of setting hwmon-name to nvme either, similarly to #156 , because there's symlinks identical to device within ng1n1 and nvme1n1, which makes it error out (despite all of the paths it founds being the same file!).

For the first problem, because I don't understand why thinkfan looks for device, I'd be happy with being able to explicitly tell it to not look at device at all in the config file; I believe that that would solve my original issue.

For the second problem, may I suggest that when you compile a list of hwmons, but before you check for the name being the same in multiple ones, you canonicalize the paths to remove any symlinks in the path elements, and then filter the list so that every canonical path is on the list only once? (This would be separate from #156, but even with the fix for the other issue, I think it'd still be beneficial, because on every poll, thinkfan would have to read less files, by not reading the same file multiple times.)

@griwes
Copy link
Author

griwes commented Oct 10, 2021

I have also tried a workaround of using optional and enumerating both those hwmon ids I've seen for the nvme devices on boot for both, but it appears that that doesn't stop thinkfan from erroring out when the directory itself doesn't exist.

So for now I'm working around this with explicitly listing sensors for /sys/class/hwmon/hwmon0 and /sys/class/hwmon/hwmon1, since empirically those two have always been the nvme hwmons (as far as I can tell from looking at this a number of times), but this is extremely fragile and I would rather not need to depend on the kernel not changing its driver assignments at any time for this to work :)

@griwes griwes changed the title For an nvme device, thinkfan errors about device not having sensors, despite hwmon* directories being present For an nvme device, thinkfan errors about device not having sensors, despite hwmon* directories being present Oct 10, 2021
@vmatare vmatare added enhancement Idea for an improvement or a new feature usability Unnecessarily confusing or complicated user interaction and removed enhancement Idea for an improvement or a new feature usability Unnecessarily confusing or complicated user interaction labels Oct 15, 2021
@vmatare
Copy link
Owner

vmatare commented Oct 15, 2021

Huh. Sorry, just got confused there because I overlooked some details. From looking at your directory listing, the way you configured it should actually work. I'll have to take another good look at the search logic.

@vmatare vmatare added bug Behavior not as intended usability Unnecessarily confusing or complicated user interaction labels Oct 15, 2021
@DanielWeigl
Copy link

DanielWeigl commented Nov 5, 2021

I have the same problem since updating to Ubuntu 21.10 (it worked on previous version)

  - hwmon: /sys/class/block/nvme0n1
    indices: [1,2,3]
    optional: true

leads to ERROR: Could not find an 'hwmon*' directory or 'temp*_input' file in /sys/class/block/nvme0n1/device/device.

Which is a bit confusing, because there is one:

find /sys/class/block/nvme0n1/device/ | grep hwmon
/sys/class/block/nvme0n1/device/hwmon4
/sys/class/block/nvme0n1/device/hwmon4/temp2_min
/sys/class/block/nvme0n1/device/hwmon4/temp2_max
/sys/class/block/nvme0n1/device/hwmon4/temp3_input
...

Also, specifying a name:

  - hwmon: /sys/class/block/nvme0n1
    name: nvme
    indices: [1,2,3]
    optional: true

leads to

ERROR: /etc/thinkfan.yaml:57:
    name: nvme
          ^
Found multiple hwmons with this name:  /sys/class/block/nvme0n1/device/nvme0n1/device/hwmon4 /sys/class/block/nvme0n1/device/hwmon4 /sys/class/block/nvme0n1/device/ng0n1/device/hwmon4.

This works:

  - hwmon: /sys/class/block/nvme0n1/device/hwmon4
    indices: [1,2,3]
    optional: true

But the index for hwmon4 is not stable.

uname -a
Linux t15 5.13.0-20-generic #20-Ubuntu SMP Fri Oct 15 14:21:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

thinkfan       version 1.2.1-3  

@vmatare
Copy link
Owner

vmatare commented Nov 5, 2021

@DanielWeigl if you want to help you can try out the lm-sensors branch, which adds support for libsensors. Then you can specify sensors like this this:

  - chip: CHIP-NAME
    ids:
      - FEATURE-ID0
      - FEATURE-ID1
      - ...

You can find CHIP-NAME and FEATURE-IDs by running sensors. E.g. for my NVME SSD I get:

nvme-pci-2500
Adapter: PCI adapter
Composite:    +35.9°C  (low  = -273.1°C, high = +80.8°C)
                       (crit = +80.8°C)
Sensor 1:     +35.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +44.9°C  (low  = -273.1°C, high = +65261.8°C)

So that would yield the config:

  - chip: nvme-pci-2500
    ids:
      - "Sensor 1"

So far libsensors support is looking good, but we've been observing an issue where it breaks down after resuming from suspend. Would be interesting to know whether you're also seeing this issue or not. On my system it has been working fine for days across multiple suspend/resume cycles as of now.

@koutheir
Copy link
Contributor

koutheir commented Nov 5, 2021

So far libsensors support is looking good, but we've been observing an issue where it breaks down after resuming from suspend. Would be interesting to know whether you're also seeing this issue or not.

@DanielWeigl, the commit fe9d2e8 in the lm-sensors branch introduced retries for some amount of time when reading sensors fails. This fixes the issue for me by retrying 5 to 6 times after resuming from suspend. If you want to test that branch without this fault-tolerant behavior, then you'll need to fetch and build the commit 05f1636 from the lm-sensors branch.

You can also test the fault-tolerant build (by building lm-sensors's head commit) and monitor the thinkfan logs for messages like this:

Retrying reading LM sensor 'CPU' of chip 'thinkpad-isa-0000': Can't read (attempt 6)

@DanielWeigl
Copy link

Hi - thank you for the heads up and for the new feature - looks good.

Im now running the latest lm-sensors branch with this settings:

  - chip: nvme-pci-5500
    ids:
     - "Sensor 1"
     - "Sensor 2"

  - chip: nvme-pci-0200
    ids:
     - "Sensor 1"
     - "Sensor 2"

so far it works, ill install it as default and will report back

@DanielWeigl
Copy link

just fyi, i have it now running with above version/config for a few days and some standbys and reboots - sofar it works and there are no errors logged to journalctl -u thinkfan.service

Only these messages:

Nov 10 17:12:59 d-t15 thinkfan[1435]: Temperatures(bias): 48(0), 45(0), 46(0), 46(0), 45(0), 35(0), 33(0), 34(0), 30(0), 43(0), 45(0) -> Fans: level 1
Nov 10 17:15:24 d-t15 thinkfan[1435]: Going to sleep: Will allow sensor read errors for the next 4 loops.
Nov 11 09:11:04 d-t15 thinkfan[1435]: Received SIGUSR2: Re-initializing fan control.
Nov 11 09:11:04 d-t15 thinkfan[1435]: Temperatures(bias): 53(0), 46(0), 31(0), 53(0), 33(0), 24(0), 23(0), 24(0), 23(0), 23(0), 23(0) -> Fans: level 1

@vmatare vmatare added this to the 2.0 milestone Mar 29, 2022
@vmatare vmatare self-assigned this Mar 29, 2022
@vmatare
Copy link
Owner

vmatare commented Aug 13, 2022

I'm fairly certain that the new lm-sensors-based config syntax is the right solution to this problem. I've been thinking about supporting this case this in the sysfs/hwmon path-based config as well, but I've come to the conclusion that resolving the ambiguity there is not worth the effort. I like the lm-sensors interface (i.e. with the chip: keyword) much better since it eliminates other problems, as well.

@vmatare vmatare closed this as completed Aug 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Behavior not as intended usability Unnecessarily confusing or complicated user interaction
Projects
None yet
Development

No branches or pull requests

4 participants