-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
smart: Gather S.M.A.R.T. information from storage devices #2449
Conversation
TODO:
|
I have updated this with more documentation, concurrent metrics gathering, different metrics structure, verified against version 5.41, 5.42, 5.43, 6.0, 6.1, 6.2, 6.3, 6.4, and 6.5. @sebito91 It would be awesome if you would help testing this by testing the performance on your 96-disk system and verify that this would also cover your use case. |
Alternativt to #2319 |
Have not had a chance to look at this yet, but happy to merge the two threads into one. Will test out on the mega machine tomorrow morning (EST). |
this is the one thing missing from telegraf that would make my life complete. |
@evanrich It would be very valuable if you could test this PR and provide feedback. |
@rickard-von-essen I won't be able to pull and test till sometime next week, on vacation this weekend, but I'll give it a go as soon as I can. |
Looking forward to this addition. Thanks for the work, @rickard-von-essen. I'll do some testing of my own on this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to drop my commit in favor of this one given this is more flexible and handles a variety of OS setups. That being said, still not seeing any of the attribute information on these hosts...
Running with the sudo wrapper you mention in the help text (which should be an option in the config IMHO), we get the output listed below with exit_status = 1
. When running as root, we get exit_status = 0
but no smart_attribute information whatsover.
[telegraf@carf-metrics-influx02 ~]$ telegraf --test --config /etc/telegraf/telegraf.conf --input-filter smart
* Plugin: inputs.smart, Collection 1
> smart_device,dc=carf,host=carf-metrics-influx02,bu=linux,device=/dev/sdh,env=production,cls=server,trd=false,sr=metrics exit_status=1i 1491517478000000000
> smart_device,device=/dev/sdbu,bu=linux,env=production,cls=server,trd=false,sr=metrics,dc=carf,host=carf-metrics-influx02 exit_status=1i 1491517478000000000
> smart_device,dc=carf,device=/dev/sdl,host=carf-metrics-influx02,bu=linux,env=production,cls=server,trd=false,sr=metrics exit_status=1i 1491517478000000000
... (repeats ~90 more times)
# devices = [ "/dev/ada0 -d atacam" ] | ||
``` | ||
|
||
To run `smartctl` with `sudo` create a wrapper script and use `path` in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we make this into a config file bool? Here's my wrapper so far, yields exit_status = 1
...
[telegraf@carf-metrics-influx02 ~]$ cat tester
#!/bin/bash
sudo /usr/sbin/smartctl $1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your wrapper has to pass all arguments, so:
#!/usr/bin/env bash
sudo /usr/sbin/smartctl $@
Exit code 1 means command line pars failed for smartctl
.
Can't we make this into a config file bool?
If the maintainers like to have that I can add it but IMHO it's unnecessary when you have path
.
Metrics will be reported from the following `smartctl` command: | ||
|
||
``` | ||
smartctl --info --attributes --health -n <nocheck> --format=brief <device> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not generating any output for the attribute checker on RHEL7 using smartctl 6.2. See below...
[telegraf@carf-metrics-influx02 ~]$ ./tester --info --attributes --health --format=brief /dev/sdcp
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.6.1.el7.jump1.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: TOSHIBA
Product: PX02SMF040
Revision: A3B3
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x500003965c89f4a0
Serial number: 65J0A025T0QB
Device type: disk
Transport protocol: SAS
Local Time is: Thu Apr 6 17:36:40 2017 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
SS Media used endurance indicator: 0%
Current Drive Temperature: 38 C
Drive Trip Temperature: 60 C
Manufactured in week 25 of year 2015
Elements in grown defect list: 0
[telegraf@carf-metrics-influx02 ~]$ ./tester --info -x --attributes --health --format=brief /dev/sdcp
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.6.1.el7.jump1.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: TOSHIBA
Product: PX02SMF040
Revision: A3B3
User Capacity: 400,088,457,216 bytes [400 GB]
Logical block size: 512 bytes
LU is resource provisioned, LBPRZ=1
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Logical Unit id: 0x500003965c89f4a0
Serial number: 65J0A025T0QB
Device type: disk
Transport protocol: SAS
Local Time is: Thu Apr 6 17:37:21 2017 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
Read Cache is: Enabled
Writeback Cache is: Disabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
SS Media used endurance indicator: 0%
Current Drive Temperature: 38 C
Drive Trip Temperature: 60 C
Manufactured in week 25 of year 2015
Elements in grown defect list: 0
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 33119.546 0
write: 0 0 0 0 0 74858.031 0
verify: 0 0 0 0 0 2.622 0
Non-medium error count: 32
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed 64 3 - [- - -]
# 2 Background long Completed 64 2 - [- - -]
# 3 Background short Completed 64 2 - [- - -]
Long (extended) Self Test duration: 1800 seconds [30.0 minutes]
Device does not support Background scan results logging
Protocol Specific port log page for SAS SSP
relative target port id = 1
generation code = 4
number of phys = 1
phy identifier = 0
attached device type: expander device
attached reason: SMP phy control function
reason: unknown
negotiated logical link rate: reserved [11]
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x500003965c89f4a2
attached SAS address = 0x5f8db882fbf5737f
attached phy identifier = 6
Invalid DWORD count = 4
Running disparity error count = 3
Loss of DWORD synchronization = 0
Phy reset problem = 0
relative target port id = 2
generation code = 4
number of phys = 1
phy identifier = 1
attached device type: expander device
attached reason: SMP phy control function
reason: loss of dword synchronization
negotiated logical link rate: reserved [11]
attached initiator port: ssp=0 stp=0 smp=0
attached target port: ssp=0 stp=0 smp=1
SAS address = 0x500003965c89f4a3
attached SAS address = 0x5f8db882fbf573ff
attached phy identifier = 6
Invalid DWORD count = 0
Running disparity error count = 0
Loss of DWORD synchronization = 0
Phy reset problem = 0
Only support protocol specific log page on SAS devices
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the most important thing to figure out. Did the format change or does this drive just not have any attributes?
I can't quite comment on the timing to run this engine as the attributes are not parsing. For now, the simple |
I did update to $@, should have changed that in the review.
Unless I pass in -x we don't get anything in terms of attributes. You
tested as far back as 6.2 right?
On Thu, Apr 6, 2017 at 8:36 PM Rickard von Essen ***@***.***> wrote:
@sebito91 <https://github.com/sebito91> Running <wrapper> --info
--attributes --health --format=brief <device> should output something
including the attributes (example here
<https://github.com/rickard-von-essen/telegraf/blob/a788a253e4ee9c538101f2f2839f748324b285a7/plugins/inputs/smart/smart_test.go#L41-L61>
).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2449 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADA1nlM8P1wb0CubKzJk8fh9dBiAREVqks5rtZM4gaJpZM4MGuJj>
.
--
--
Sebastian Borza
PGP: EDC2 BF61 4B91 14F2 AAB4 06C9 3744 7F3F E411 0D3E
|
@sebito91 Yes I tested all of 5.41, 5.42, 5.43, 6.[0-6]. |
@sebito91 What does |
|
After merging the branch @rickard-von-essen created into v1.3.1 and building from source, this is what I receive from a test run on one disk. Is this the expected results?
|
@stemwinder Looks correct to me. |
Any idea on when this will get merged? I'd love me some HDD temp stats on my NAS :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also see #2319 (review)
Makefile
Outdated
@@ -24,7 +25,7 @@ build-windows: | |||
./cmd/telegraf/telegraf.go | |||
|
|||
build-for-docker: | |||
CGO_ENABLED=0 GOOS=linux go build -installsuffix cgo -o telegraf -ldflags \ | |||
CGO_ENABLED=0 GOOS=$(GOOS) go build -installsuffix cgo -o telegraf -ldflags \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this code from the Makefile for this pull request
* Tags: | ||
- `capacity` | ||
- `device` | ||
- `device_model` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just model
, since the measurement name is smart_device
.
plugins/inputs/smart/README.md
Outdated
- `id` | ||
- `name` | ||
* Fields: | ||
- `exit_status` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would rather leave exit status to the internal plugin and the logging output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept this and added some info in the README. It consists of a bit pattern that can be useful to find drives that are in some way failing or starting to fail.
# devices = [ "/dev/ada0 -d atacam" ] | ||
``` | ||
|
||
To run `smartctl` with `sudo` create a wrapper script and use `path` in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a nice touch to have sudo support, it's just a little more convenient.
You can add a use_sudo field like we did in [fail2ban(https://github.com/influxdata/telegraf/blob/ca9cec2c84e7c8796c2e8a747d17d1ad86ce1ae6/plugins/inputs/fail2ban/README.md#configuration), or it might be more readable and extensible to have something like ansible: become_method = "sudo"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added use_sudo
plugins/inputs/smart/smart.go
Outdated
# | ||
## Skip checking disks in this power mode. Defaults to | ||
## "standby" to not wake up disks that have stoped rotating. | ||
## See --nockeck in the man pages for smartctl. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nockeck (sic)
plugins/inputs/smart/smart.go
Outdated
// Get info and attributes for each S.M.A.R.T. device | ||
func (m *Smart) getAttributes(acc telegraf.Accumulator, devices []string) []error { | ||
|
||
errchan := make(chan error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have Accumulator.AddError you should use this to report errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume then I don't have to return those errors from Input.Gather()
(this isn't obvious from the documentation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, only use AddError or return. This way Telegraf counts the correct number of errors in the internal plugin, and the logging looks right.
plugins/inputs/smart/smart.go
Outdated
|
||
func gatherDisk(acc telegraf.Accumulator, path, nockeck, device string, err chan error) { | ||
|
||
// smartctl 5.41 & 5.42 have are broken regarding handling of --nocheck/-n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what distro's contain these versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debian oldstable (7) is the only one that I'm aware of that is still supported.
Metrics will be reported from the following `smartctl` command: | ||
|
||
``` | ||
smartctl --info --attributes --health -n <nocheck> --format=brief <device> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the most important thing to figure out. Did the format change or does this drive just not have any attributes?
This adds a new input plugin which uses the `smartctl` utility from the smartmontools package to gather metrics from S.M.A.R.T. storage devices. Signed-off-by: Rickard von Essen <[email protected]>
5.41 and 5.42 have problems determining the current power mode and don't recognise the --nocheck argument even tough it's in the docs.
I'm working on addressing the review comments and some improvements. Stay tuned. |
a788a25
to
3d44bad
Compare
FYI, I've been running this plugin for two months now with zero issues. Looking forward to it finding its way in to the official release. I would recommend updating the documentation to suggest the user make use of |
@danielnelson This is ready for re-review. |
@stemwinder Good suggestion, can you suggest something for that? |
I thought you were going to add support for the error counters log, but I don't see metrics for these:
|
Using |
@rickard-von-essen I would recommend tucking in something like the following on line 6 of the README:
And then follow that up by using something like @danielnelson Hopefully this is a decent starting point for other plugins or general documentation. The use of non-persistent device paths really does muck things up unless you're dealing with a completely closed system. |
@stemwinder I agree in general, but have two comments: 1) it's linux specific, I'll add something about that to the text. 2) If you have the problem of disks moving around you probably have lots of disks and/or hosts and then most likely you would like use the autodetect feature ( |
@danielnelson Yes, but I need some help from @sebito91. (Depending on how quick we can sort it out I'll get it into this or a new PR). Non of my drives have the Error counter log. Can you somehow verify that this is a SAS disk feature or is this a Toshiba specific feature? What is the minimal argument you need to pass to |
@rickard-von-essen I completely agree that users with large amounts of systems and disks would probably just rather use the scan option and only pay attention to current state, disregarding state history. But this caveat does need to be explained I think, so that when people see turnovers from hundreds of thousands of start/stops to < 20 on the same device name, for example, they know why. If it were me in their position, I would fork the plugin and change it to derive the WWID from the scan output. |
@rickard-von-essen will get you some more HDD/SSD data later this weekend. |
Added WWN to both smart_device and smart_attribute measurements. And added serial_no also to smart_attribute.
@stemwinder Added WWN and some info in da606ac, please review. |
@rickard-von-essen Can you swap out |
Done |
Is there any windows support for this plugin? I have a fairly large system I would love to test this on |
@vlambaard Yes I guess it should work as long as you have the |
This adds a new input plugin which uses the smartctl utility from the
smartmontools package to gather metrics from S.M.A.R.T. storage devices.
Signed-off-by: Rickard von Essen [email protected]
Supersedes #2402
Closes #1880
Required for all PRs: