Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smart: Gather S.M.A.R.T. information from storage devices #2449

Merged
merged 18 commits into from
Oct 4, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions plugins/inputs/all/all.go
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ import (
_ "github.com/influxdata/telegraf/plugins/inputs/rethinkdb"
_ "github.com/influxdata/telegraf/plugins/inputs/riak"
_ "github.com/influxdata/telegraf/plugins/inputs/sensors"
_ "github.com/influxdata/telegraf/plugins/inputs/smart"
_ "github.com/influxdata/telegraf/plugins/inputs/snmp"
_ "github.com/influxdata/telegraf/plugins/inputs/snmp_legacy"
_ "github.com/influxdata/telegraf/plugins/inputs/socket_listener"
Expand Down
135 changes: 135 additions & 0 deletions plugins/inputs/smart/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Telegraf S.M.A.R.T. plugin

Get metrics using the command line utility `smartctl` for S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) storage devices. SMART is a monitoring system included in computer hard disk drives (HDDs) and solid-state drives (SSDs)[1] that detects and reports on various indicators of drive reliability, with the intent of enabling the anticipation of hardware failures.
See smartmontools (https://www.smartmontools.org/).

If no devices are specified, the plugin will scan for SMART devices via the following command:

```
smartctl --scan
```

Metrics will be reported from the following `smartctl` command:

```
smartctl --info --attributes --health -n <nocheck> --format=brief <device>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not generating any output for the attribute checker on RHEL7 using smartctl 6.2. See below...

[telegraf@carf-metrics-influx02 ~]$ ./tester --info --attributes --health --format=brief /dev/sdcp
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.6.1.el7.jump1.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              PX02SMF040
Revision:             A3B3
User Capacity:        400,088,457,216 bytes [400 GB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x500003965c89f4a0
Serial number:        65J0A025T0QB
Device type:          disk
Transport protocol:   SAS
Local Time is:        Thu Apr  6 17:36:40 2017 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

SS Media used endurance indicator: 0%
Current Drive Temperature:     38 C
Drive Trip Temperature:        60 C

Manufactured in week 25 of year 2015
Elements in grown defect list: 0

[telegraf@carf-metrics-influx02 ~]$ ./tester --info -x --attributes --health --format=brief /dev/sdcp
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.6.1.el7.jump1.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               TOSHIBA
Product:              PX02SMF040
Revision:             A3B3
User Capacity:        400,088,457,216 bytes [400 GB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate:        Solid State Device
Form Factor:          2.5 inches
Logical Unit id:      0x500003965c89f4a0
Serial number:        65J0A025T0QB
Device type:          disk
Transport protocol:   SAS
Local Time is:        Thu Apr  6 17:37:21 2017 CDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

SS Media used endurance indicator: 0%
Current Drive Temperature:     38 C
Drive Trip Temperature:        60 C

Manufactured in week 25 of year 2015
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0         0          0      33119.546           0
write:         0        0         0         0          0      74858.031           0
verify:        0        0         0         0          0          2.622           0

Non-medium error count:       32

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  64       3                 - [-   -    -]
# 2  Background long   Completed                  64       2                 - [-   -    -]
# 3  Background short  Completed                  64       2                 - [-   -    -]
Long (extended) Self Test duration: 1800 seconds [30.0 minutes]

Device does not support Background scan results logging
Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 4
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: reserved [11]
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x500003965c89f4a2
    attached SAS address = 0x5f8db882fbf5737f
    attached phy identifier = 6
    Invalid DWORD count = 4
    Running disparity error count = 3
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
relative target port id = 2
  generation code = 4
  number of phys = 1
  phy identifier = 1
    attached device type: expander device
    attached reason: SMP phy control function
    reason: loss of dword synchronization
    negotiated logical link rate: reserved [11]
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x500003965c89f4a3
    attached SAS address = 0x5f8db882fbf573ff
    attached phy identifier = 6
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
Only support protocol specific log page on SAS devices

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the most important thing to figure out. Did the format change or does this drive just not have any attributes?

```

This plugin supports _smartmontools_ version 5.41 and above, but v. 5.41 and v. 5.42
might require setting `nocheck`, see the comment in the sample configuration.

To enable SMART on a storage device run:

```
smartctl -s on <device>
```

## Measurements

- smart_device:

* Tags:
- `capacity`
- `device`
- `device_model`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just model, since the measurement name is smart_device.

- `enabled`
- `health`
- `serial_no`
- `wwn`
* Fields:
- `exit_status`
- `health_ok`
- `read_error_rate`
- `seek_error`
- `temp_c`
- `udma_crc_errors`

- smart_attribute:

* Tags:
- `device`
- `fail`
- `flags`
- `id`
- `name`
- `serial_no`
- `wwn`
* Fields:
- `exit_status`
- `raw_value`
- `threshold`
- `value`
- `worst`

### Flags

The interpretation of the tag `flags` is:
- *K* auto-keep
- *C* event count
- *R* error rate
- *S* speed/performance
- *O* updated online
- *P* prefailure warning

### Exit Status

The `exit_status` field captures the exit status of the smartctl command which
is defined by a bitmask. For the interpretation of the bitmask see the man page for
smartctl.

### Device Names

Device names, e.g., `/dev/sda`, are *not persistent*, and may be
subject to change across reboots or system changes. Instead, you can the
*World Wide Name* (WWN) or serial number to identify devices. On Linux block
devices can be referenced by the WWN in the following location:
`/dev/disk/by-id/`.

## Configuration

```toml
# Read metrics from storage devices supporting S.M.A.R.T.
[[inputs.smart]]
## Optionally specify the path to the smartctl executable
# path = "/usr/bin/smartctl"
#
## On most platforms smartctl requires root access.
## Setting 'use_sudo' to true will make use of sudo to run smartctl.
## Sudo must be configured to to allow the telegraf user to run smartctl
## with out password.
# use_sudo = false
#
## Skip checking disks in this power mode. Defaults to
## "standby" to not wake up disks that have stoped rotating.
## See --nockeck in the man pages for smartctl.
## smartctl version 5.41 and 5.42 have faulty detection of
## power mode and might require changing this value to
## "never" depending on your storage device.
# nocheck = "standby"
#
## Gather detailed metrics for each SMART Attribute.
## Defaults to "false"
##
# attributes = false
#
## Optionally specify devices to exclude from reporting.
# excludes = [ "/dev/pass6" ]
#
## Optionally specify devices and device type, if unset
## a scan (smartctl --scan) for S.M.A.R.T. devices will
## done and all found will be included except for the
## excluded in excludes.
# devices = [ "/dev/ada0 -d atacam" ]
```

To run `smartctl` with `sudo` create a wrapper script and use `path` in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we make this into a config file bool? Here's my wrapper so far, yields exit_status = 1...

[telegraf@carf-metrics-influx02 ~]$ cat tester
#!/bin/bash

sudo /usr/sbin/smartctl $1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your wrapper has to pass all arguments, so:

#!/usr/bin/env bash

sudo /usr/sbin/smartctl $@

Exit code 1 means command line pars failed for smartctl.

Can't we make this into a config file bool?

If the maintainers like to have that I can add it but IMHO it's unnecessary when you have path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a nice touch to have sudo support, it's just a little more convenient.

You can add a use_sudo field like we did in [fail2ban(https://github.com/influxdata/telegraf/blob/ca9cec2c84e7c8796c2e8a747d17d1ad86ce1ae6/plugins/inputs/fail2ban/README.md#configuration), or it might be more readable and extensible to have something like ansible: become_method = "sudo"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added use_sudo

the configuration to execute that.

## Output

Example output from an _Apple SSD_:
```
> smart_attribute,serial_no=S1K5NYCD964433,wwn=5002538655584d30,id=199,name=UDMA_CRC_Error_Count,flags=-O-RC-,fail=-,host=mbpro.local,device=/dev/rdisk0 threshold=0i,raw_value=0i,exit_status=0i,value=200i,worst=200i 1502536854000000000
> smart_attribute,device=/dev/rdisk0,serial_no=S1K5NYCD964433,wwn=5002538655584d30,id=240,name=Unknown_SSD_Attribute,flags=-O---K,fail=-,host=mbpro.local exit_status=0i,value=100i,worst=100i,threshold=0i,raw_value=0i 1502536854000000000
> smart_device,enabled=Enabled,host=mbpro.local,device=/dev/rdisk0,model=APPLE\ SSD\ SM0512F,serial_no=S1K5NYCD964433,wwn=5002538655584d30,capacity=500277790720 udma_crc_errors=0i,exit_status=0i,health_ok=true,read_error_rate=0i,temp_c=40i 1502536854000000000
```
Loading