Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMART input works in 1.15.3 and fails in 1.16.0 with exact same config #8313

Closed
chrishoage opened this issue Oct 25, 2020 · 7 comments · Fixed by #8374
Closed

SMART input works in 1.15.3 and fails in 1.16.0 with exact same config #8313

chrishoage opened this issue Oct 25, 2020 · 7 comments · Fixed by #8374
Labels
bug unexpected problem or unintended behavior

Comments

@chrishoage
Copy link

Relevant telegraf.conf:

[[inputs.smart]]
  use_sudo = true
  devices = [
    "/dev/disk/by-id/ata-Crucial_CT525MX300SSD1_1651150FA577",
    "/dev/disk/by-id/ata-Crucial_CT525MX300SSD1_16431465A85A",
    "/dev/disk/by-id/scsi-SATA_HGST_HDN724040AL_PK1334PEJLL6NS",
    "/dev/disk/by-id/scsi-SATA_HGST_HDN724040AL_PK1334PEK49SBS",
    "/dev/disk/by-id/scsi-SATA_HGST_HDN724040AL_PK1334PEKDNZ0S",
    "/dev/disk/by-id/scsi-SATA_HGST_HDN724040AL_PK1334PEKDXVTS",
    "/dev/disk/by-id/scsi-SATA_HGST_HDN724040AL_PK2334PEJM9B3T",
    "/dev/disk/by-id/scsi-SATA_HGST_HDN724040AL_PK2334PEK4AXTT",
    "/dev/disk/by-id/scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4E4FKJ5DV",
    "/dev/disk/by-id/scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4E4FKJH1X",
    "/dev/disk/by-id/scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EECRN58H",
    "/dev/disk/by-id/scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EK8ZSK37",
    "/dev/disk/by-id/scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EM0WN624"
  ]

System info:

› uname -a
Linux cortex 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Docker

Steps to reproduce:

  1. Install 1.15.3
  2. run with config
  3. verify output with telegraf --test
  4. upgrade to 1.16.0
  5. run telegraf --test

Expected behavior:

Config to work after upgrade

Actual behavior:

Config fails.

Additional info:

I initially saw this error. After installing nvme-cli the error went away, but the SMART input would still not output anything

[inputs.smart] nvme not found: verify that nvme is installed and it is in your PATH (or specified in config) to gather vendor specific attributes: provided path does not exist: []
› sudo -u telegraf telegraf --config /etc/telegraf/telegraf.conf  --test | grep smart
2020-10-25T21:32:18Z I! Starting Telegraf 1.15.3
> smart_device,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EM0WN624,host=cortex exit_status=2i 1603661539000000000
> smart_device,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EECRN58H,host=cortex exit_status=2i 1603661539000000000
> smart_device,capacity=525112713216,device=ata-Crucial_CT525MX300SSD1_16431465A85A,enabled=Enabled,host=cortex,model=Crucial_CT525MX300SSD1,serial_no=16431465A85A,wwn=500a07511465a85a exit_status=0i,health_ok=true,read_error_rate=2i,temp_c=37i,udma_crc_errors=0i 1603661539000000000
> smart_device,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4E4FKJH1X,host=cortex exit_status=2i 1603661539000000000
> smart_device,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4E4FKJ5DV,host=cortex exit_status=2i 1603661539000000000
> smart_device,capacity=525112713216,device=ata-Crucial_CT525MX300SSD1_1651150FA577,enabled=Enabled,host=cortex,model=Crucial_CT525MX300SSD1,serial_no=1651150FA577,wwn=500a0751150fa577 exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661539000000000
> smart_device,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EK8ZSK37,host=cortex exit_status=2i 1603661539000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEJLL6NS,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEJLL6NS,wwn=5000cca250e4a210 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=33i,udma_crc_errors=0i 1603661540000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK2334PEJM9B3T,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK2334PEJM9B3T,wwn=5000cca250e4f530 exit_status=0i,health_ok=true,read_error_rate=2i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661540000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEKDXVTS,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEKDXVTS,wwn=5000cca250f02751 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=33i,udma_crc_errors=0i 1603661540000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK2334PEK4AXTT,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK2334PEK4AXTT,wwn=5000cca250ec4105 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661540000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEKDNZ0S,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEKDNZ0S,wwn=5000cca250f009ad exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=37i,udma_crc_errors=0i 1603661540000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEK49SBS,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEK49SBS,wwn=5000cca250ec3c9c exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661540000000000
› sudo cat /etc/sudoers.d/telegraf
Cmnd_Alias SMARTCTL = /usr/sbin/smartctl
telegraf ALL=(ALL) NOPASSWD: SMARTCTL
Defaults!SMARTCTL !logfile, !syslog, !pam_session

Cmnd_Alias NVME = /usr/sbin/nvme
telegraf ALL=(ALL) NOPASSWD: NVME
Defaults!NVME !logfile, !syslog, !pam_session
› sudo -u telegraf bash
telegraf@cortex:~$ which smartctl
/usr/sbin/smartctl
telegraf@cortex:~$ which nvme
/usr/sbin/nvme
telegraf@cortex:~$ sudo smartctl --info --attributes --health -n standby --format=brief /dev/disk/by-id/scsi-SATA_HGST_HDN724040AL_PK1334PEJLL6NS
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-52-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     HGST Deskstar NAS
Device Model:     HGST HDN724040ALE640
Serial Number:    PK1334PEJLL6NS
LU WWN Device Id: 5 000cca 250e4a210
Firmware Version: MJAOA5E0
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Oct 25 14:29:35 2020 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:    ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   016    -    0
  2 Throughput_Performance  P-S---   136   136   054    -    83
  3 Spin_Up_Time            POS---   165   165   024    -    497 (Average 440)
  4 Start_Stop_Count        -O--C-   100   100   000    -    47
  5 Reallocated_Sector_Ct   PO--CK   100   100   005    -    0
  7 Seek_Error_Rate         PO-R--   100   100   067    -    0
  8 Seek_Time_Performance   P-S---   121   121   020    -    34
  9 Power_On_Hours          -O--C-   096   096   000    -    32685
 10 Spin_Retry_Count        PO--C-   100   100   060    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    47
192 Power-Off_Retract_Count -O--CK   100   100   000    -    308
193 Load_Cycle_Count        -O--C-   100   100   000    -    308
194 Temperature_Celsius     -O----   181   181   000    -    33 (Min/Max 23/55)
196 Reallocated_Event_Count -O--CK   100   100   000    -    0
197 Current_Pending_Sector  -O---K   100   100   000    -    0
198 Offline_Uncorrectable   ---R--   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O-R--   200   200   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

Here is the sample commands showing that downgrading works

chris at cortex in ~
› sudo apt-get upgrade telegraf
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Calculating upgrade... Done
The following packages will be upgraded:
  telegraf
1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/21.8 MB of archives.
After this operation, 1,598 kB of additional disk space will be used.
Do you want to continue? [Y/n] y
(Reading database ... 113471 files and directories currently installed.)
Preparing to unpack .../telegraf_1.16.0-1_amd64.deb ...
Unpacking telegraf (1.16.0-1) over (1.15.3-1) ...
Setting up telegraf (1.16.0-1) ...
Installing new version of config file /etc/telegraf/telegraf.conf.sample ...
Synchronizing state of telegraf.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable telegraf

chris at cortex in ~
› sudo -u telegraf telegraf --config /etc/telegraf/telegraf.conf  --test | grep smart
2020-10-25T21:39:12Z I! Starting Telegraf 1.16.0

chris at cortex in ~
› sudo dpkg -i ~/downloads/telegraf_1.15.3-1_amd64.deb
dpkg: warning: downgrading telegraf from 1.16.0-1 to 1.15.3-1
(Reading database ... 113471 files and directories currently installed.)
Preparing to unpack .../telegraf_1.15.3-1_amd64.deb ...
Unpacking telegraf (1.15.3-1) over (1.16.0-1) ...
Setting up telegraf (1.15.3-1) ...
Installing new version of config file /etc/telegraf/telegraf.conf.sample ...
Synchronizing state of telegraf.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable telegraf

chris at cortex in ~
› sudo -u telegraf telegraf --config /etc/telegraf/telegraf.conf  --test | grep smart
2020-10-25T21:39:48Z I! Starting Telegraf 1.15.3
> smart_device,capacity=525112713216,device=ata-Crucial_CT525MX300SSD1_1651150FA577,enabled=Enabled,host=cortex,model=Crucial_CT525MX300SSD1,serial_no=1651150FA577,wwn=500a0751150fa577 exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661989000000000
> smart_device,capacity=525112713216,device=ata-Crucial_CT525MX300SSD1_16431465A85A,enabled=Enabled,host=cortex,model=Crucial_CT525MX300SSD1,serial_no=16431465A85A,wwn=500a07511465a85a exit_status=0i,health_ok=true,read_error_rate=2i,temp_c=37i,udma_crc_errors=0i 1603661989000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4E4FKJH1X,enabled=Enabled,host=cortex,model=WDC\ WD40EFRX-68WT0N0,serial_no=WD-WCC4E4FKJH1X,wwn=50014ee20a70d5a0 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=29i,udma_crc_errors=0i 1603661989000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EM0WN624,enabled=Enabled,host=cortex,model=WDC\ WD40EFRX-68WT0N0,serial_no=WD-WCC4EM0WN624,wwn=50014ee2b51b9d7f exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=30i,udma_crc_errors=0i 1603661989000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EK8ZSK37,enabled=Enabled,host=cortex,model=WDC\ WD40EFRX-68WT0N0,serial_no=WD-WCC4EK8ZSK37,wwn=50014ee2b51c8ebd exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=30i,udma_crc_errors=0i 1603661989000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EECRN58H,enabled=Enabled,host=cortex,model=WDC\ WD40EFRX-68WT0N0,serial_no=WD-WCC4EECRN58H,wwn=50014ee20a98bd99 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=33i,udma_crc_errors=0i 1603661989000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4E4FKJ5DV,enabled=Enabled,host=cortex,model=WDC\ WD40EFRX-68WT0N0,serial_no=WD-WCC4E4FKJ5DV,wwn=50014ee25fc65114 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=29i,udma_crc_errors=0i 1603661989000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEK49SBS,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEK49SBS,wwn=5000cca250ec3c9c exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661990000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEJLL6NS,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEJLL6NS,wwn=5000cca250e4a210 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=34i,udma_crc_errors=0i 1603661990000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEKDXVTS,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEKDXVTS,wwn=5000cca250f02751 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=34i,udma_crc_errors=0i 1603661990000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK2334PEJM9B3T,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK2334PEJM9B3T,wwn=5000cca250e4f530 exit_status=0i,health_ok=true,read_error_rate=2i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661990000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK2334PEK4AXTT,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK2334PEK4AXTT,wwn=5000cca250ec4105 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661990000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEKDNZ0S,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEKDNZ0S,wwn=5000cca250f009ad exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661990000000000
@chrishoage chrishoage added the bug unexpected problem or unintended behavior label Oct 25, 2020
@p-zak
Copy link
Collaborator

p-zak commented Oct 26, 2020

@chrishoage

Missing NVMe attributes were added to smart plugin in #8113 (together with optional usage of nvme-cli).

I suspect what caused your problem but lets clarify few things:

I initially saw this error. After installing nvme-cli the error went away, but the SMART input would still not output anything

[inputs.smart] nvme not found: verify that nvme is installed and it is in your PATH (or specified in config) to gather vendor specific attributes: provided path does not exist: []
› sudo -u telegraf telegraf --config /etc/telegraf/telegraf.conf  --test | grep smart
2020-10-25T21:32:18Z I! Starting Telegraf 1.15.3
> smart_device,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EM0WN624,host=cortex exit_status=2i 1603661539000000000
> smart_device,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EECRN58H,host=cortex exit_status=2i 1603661539000000000
> smart_device,capacity=525112713216,device=ata-Crucial_CT525MX300SSD1_16431465A85A,enabled=Enabled,host=cortex,model=Crucial_CT525MX300SSD1,serial_no=16431465A85A,wwn=500a07511465a85a exit_status=0i,health_ok=true,read_error_rate=2i,temp_c=37i,udma_crc_errors=0i 1603661539000000000
> smart_device,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4E4FKJH1X,host=cortex exit_status=2i 1603661539000000000
> smart_device,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4E4FKJ5DV,host=cortex exit_status=2i 1603661539000000000
> smart_device,capacity=525112713216,device=ata-Crucial_CT525MX300SSD1_1651150FA577,enabled=Enabled,host=cortex,model=Crucial_CT525MX300SSD1,serial_no=1651150FA577,wwn=500a0751150fa577 exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661539000000000
> smart_device,device=scsi-SATA_WDC_WD40EFRX-68W_WD-WCC4EK8ZSK37,host=cortex exit_status=2i 1603661539000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEJLL6NS,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEJLL6NS,wwn=5000cca250e4a210 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=33i,udma_crc_errors=0i 1603661540000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK2334PEJM9B3T,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK2334PEJM9B3T,wwn=5000cca250e4f530 exit_status=0i,health_ok=true,read_error_rate=2i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661540000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEKDXVTS,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEKDXVTS,wwn=5000cca250f02751 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=33i,udma_crc_errors=0i 1603661540000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK2334PEK4AXTT,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK2334PEK4AXTT,wwn=5000cca250ec4105 exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661540000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEKDNZ0S,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEKDNZ0S,wwn=5000cca250f009ad exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=37i,udma_crc_errors=0i 1603661540000000000
> smart_device,capacity=4000787030016,device=scsi-SATA_HGST_HDN724040AL_PK1334PEK49SBS,enabled=Enabled,host=cortex,model=HGST\ HDN724040ALE640,serial_no=PK1334PEK49SBS,wwn=5000cca250ec3c9c exit_status=0i,health_ok=true,read_error_rate=0i,seek_error_rate=0i,temp_c=36i,udma_crc_errors=0i 1603661540000000000
  1. nvme-cli is optional tool (only for gathering additional metrics from NVMe devices) and as I see it is not needed for your drives. Message is logged on warning level - not on error level.
  2. You wrote but the SMART input would still not output anything for Telegraf 1.16.0 and after that I can see log for Telegraf 1.15.3. Can you explain that inconsistency?

Moreover, can you run smartctl --scan and provide output from this command?

@chrishoage
Copy link
Author

@p-zak Thank you for your reply!

nvme-cli is optional tool (only for gathering additional metrics from NVMe devices) and as I see it is not needed for your drives. Message is logged on warning level - not on error level.

Got it! I was just trying anything I could think of to get it to work. Finally I tried downgrading which worked.

You wrote but the SMART input would still not output anything for Telegraf 1.16.0 and after that I can see log for Telegraf 1.15.3. Can you explain that inconsistency?

No inconsistency, that log is to show that it worked with 1.15.3. If you observe the very last code block in the GHI body you can see I upgraded, ran the failing test (which shows the 1.16.0 version) then downgrading with dpkg to 1.15.3 and ran the test again, which worked. All with no configuration change.

Moreover, can you run smartctl --scan and provide output from this command?

chris at cortex in ~
› sudo -u telegraf bash
telegraf@cortex:/home/chris$ sudo smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device

I will note however, that my configuration specified the devices array, with by-id devs since the letter identifiers are not stable. It is my understading from the documentation that when devices is specified smartctl --scan is not used.

@p-zak
Copy link
Collaborator

p-zak commented Oct 26, 2020

@chrishoage

No inconsistency, that log is to show that it worked with 1.15.3. If you observe the very last code block in the GHI body you can see I upgraded, ran the failing test (which shows the 1.16.0 version) then downgrading with dpkg to 1.15.3 and ran the test again, which worked. All with no configuration change.

I just thought so, just wanted to have everything clear :)

We know what the problem is (accidentally recognized _ character as an invalid in device path), just need some time to prepare proper fix and upstream it.

Till that time I see these options:

  • Use Telegraf 1.15.3
  • Put real paths to your devices in devices parameter (may not be an option if your letter identifiers are not stable):
devices = [
"/dev/sda -d scsi",
"/dev/sdb -d scsi",
"/dev/sdc -d scsi",
"/dev/sdd -d scsi",
"/dev/sde -d scsi",
"/dev/sdf -d scsi",
"/dev/sdg -d scsi",
"/dev/sdh -d scsi",
"/dev/sdi -d scsi",
"/dev/sdj -d scsi",
"/dev/sdk -d scsi",
"/dev/sdl -d scsi",
"/dev/sdm -d scsi"
]
  • Remove devices from configuration to gather metrics from every device and use excludes to exclude unwanted devices (may not be an option if your letter identifiers are not stable).
  • Use different by-xxx in /dev/disk/ if possible

@chrishoage
Copy link
Author

@p-zak Thank you very much for your help in this issue!

Since there are no security fixes (that I could see from the release notes anyway) in 1.16.0 I think I will stay on 1.15.3 until a patch can be released.

I am more than happy to run a dev build / RC on my host to verify the fix when the time comes! Please reply here with the build to use and I will test 🙂

If I do encounter a need to upgrade before the patch can be released I will use one of the workarounds you suggested.

Again, I appreciate your swift help in finding the issue!

@Feliksas
Copy link
Contributor

Feliksas commented Nov 4, 2020

This is also an issue when '+' is present in device string (such lines seem to be ignored by Telegraf altogether):

4150     devices = [
4151         "/dev/disk/by-id/wwn-0x6782bcb0444512002720be7b182ead72 -d sat+megaraid,0",
4152         "/dev/disk/by-id/wwn-0x6782bcb0444512002720be7b182ead72 -d sat+megaraid,1",
4153         "/dev/disk/by-id/wwn-0x6782bcb0444512002720be7b182ead72 -d sat+megaraid,2",
4154         "/dev/disk/by-id/wwn-0x6782bcb0444512002720be7b182ead72 -d sat+megaraid,3",
4155         "/dev/disk/by-id/wwn-0x6782bcb0444512002720bec51c8f901d -d sat+megaraid,4",
4156         "/dev/disk/by-id/wwn-0x6782bcb0444512002720bee01e293bad -d sat+megaraid,5"
4157     ]

@zak-pawel
Copy link
Collaborator

@chrishoage @Feliksas Should work right now. Can you check using nightly build https://github.com/influxdata/telegraf#nightly-builds?

@chrishoage
Copy link
Author

I can confirm the nightly fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants