-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial commit smartctl, telegraf #2319
Conversation
I think #2402 has some extra bits we could incorporate here, good looking out @rickard-von-essen, @sparrc. Happy to combine efforts to make this plugin more complete/usable/amazing. |
Cool! I started to work on this just a few days before you opened this. I'll review and test this. |
Some initial comments:
|
Agreed that this collector should not be limited to Linux, but my experience with OSes only extends about that far, at least as far as any Linux-variant that's not FreeBSD :D. Also, how do you propose collecting counters if sudo is optional? Are you recommending that the user run these collectors as root? If so, that's a danger on its own that should be avoided, and to collect data proper in Linux we'll need root-like perms. Long story short, I was inspired by #1880 and selfishly coded up a solution for my known environment. There is some good discussion in that thread, absolutely happy to open this discussion up further! |
At least on Darwin/macOS there is no need for |
Nobody runs telegraf in production on macOS, and from what I can tell macOS is the only OS that can expose these metrics in user-space without significant permissions fiddling. as to supporting other OSes in general, do we know if the output of smartmontools is consistent? |
Seems consistent, since the code is basically the same and it doesn't present OS dependent information. I think the concerns are for different versions of smartctl and vendors of disks.
Not a real argument, and at least I have mac mini's in a "production like" setup which I would like to monitor for hw failures.
Agree, but OpenBSD for example don't have Add a config option for the full path of #!/bin/bash
sudo /usr/bin/smartctl $@ [[inputs.smartctl]]
path = "/etc/telegraf/wrappers/smartctl" |
fair enough, I wouldn't mind if we added a "path" argument to the plugin, many other plugins that run external processes expose that as well |
@sebito91 there is some issues with parsing. I removed sudo and added some debug printouts. Scan works nicely it finds my 7 ata disks and the CAM passthrough device (which should fail $ ./telegraf_bsd --config smartctl.conf --test
* Plugin: inputs.smartctl, Collection 1
DEBUG: disks: /dev/ada0 -d atacam|/dev/ada1 -d atacam|/dev/ada2 -d atacam|/dev/ada3 -d atacam|/dev/ada4 -d atacam|/dev/ada5 -d atacam|/dev/pass6 -d atacam|/dev/ada6 -d atacam
> smartctl,host=freenas.local,vendor=none,product=none,block_size=none,rpm=5400,transport=none,read_cache=none,writeback=none,name=/dev/ada0_atacam,serial=none health=0,current_temp=0,max_temp=0 1487091972000000000
> smartctl,vendor=none,block_size=none,serial=none,transport=none,host=freenas.local,name=/dev/ada6_atacam,product=none,rpm=none,read_cache=none,writeback=none health=0,current_temp=0,max_temp=0 1487091972000000000
> smartctl,block_size=none,read_cache=none,writeback=none,host=freenas.local,name=/dev/ada3_atacam,vendor=none,rpm=5400,transport=none,product=none,serial=none current_temp=0,max_temp=0,health=0 1487091972000000000
$ smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build) Sample output of |
"Serial": regexp.MustCompile(`Serial number:\s+(\w+)`), | ||
"Rotation": regexp.MustCompile(`Rotation Rate:\s+(\w+)`), | ||
"Transport": regexp.MustCompile(`Transport protocol:\s+(\w+)`), | ||
"Health": regexp.MustCompile(`SMART Health Status:\s+(\w+)`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 6.5: SMART overall-health self-assessment test result: PASSED
// tagFinders is a global map of Disk struct elements to corresponding regexp | ||
var tagFinders = map[string]*regexp.Regexp{ | ||
"Vendor": regexp.MustCompile(`Vendor:\s+(\w+)`), | ||
"Product": regexp.MustCompile(`Product:\s+(\w+)`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 6.5: Model Family: Western Digital Red
and Device Model: WDC WD30EFRX-68EUZN0
"Vendor": regexp.MustCompile(`Vendor:\s+(\w+)`), | ||
"Product": regexp.MustCompile(`Product:\s+(\w+)`), | ||
"Block": regexp.MustCompile(`Logical block size:\s+(\w+)`), | ||
"Serial": regexp.MustCompile(`Serial number:\s+(\w+)`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 6.5 Serial Number: WD-WMC4N0905071
|
||
// fieldFinders is a global map of DiskStats struct elements | ||
var fieldFinders = map[string]*regexp.Regexp{ | ||
"CurrentTemp": regexp.MustCompile(`Current Drive Temperature:\s+([0-9]+)`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 6.5 Current Temperature: 37 Celsius
So it looks like there are some oddities across the versions. Thanks for the gist input, I'll drop that into the smartctl_test.go file and see what comes out from it. I'd be surprised that one 3 of the 7 drives return data, hopefully it's not a concurrency issue :/ |
Yesterday I started to dig into this code and see if could get it to work on Darwin and FreeBSD with smartmontools 6.5. You can see my changes on @smartctl-2 mainly concerning Go style. I have some comments on the code:
*) My guess is that these causes my issue with only 3 out of 7 disks showing up.
It's better to do:
Since it makes it easy to parse all attributes, even for ones the we haven't seen. I it would be good with a list of the features that we think is missing in #2402. @sebito91 & @sparrc thoughts, comments? |
Can you explain why you would want to do this? This doesn't seem like good schema design to me. Give some examples of queries and/or GROUP BYs that this would help you to perform.
What do you mean by this? you can still add arbitrary fields to the plugin without having the |
On the contrary this is good be both because of storage model *) and in a UX perspective. Having lots of fields (many of them empty) are inefficient. On the UX side it's tedious to work with measurements with lots of fields (50+). You are probably interest in all attributes for one device or one or a few attributes for all/a group of devices.
Yes, but it rather inconvenient to handle unknown (by smartmontool) attributes such as: UPDATE: *) this might be less of an issue since InfluxDB switched from LevelDB. |
Replying to your earlier comments, I think that we're on different pages here with the idea for this plugin
|
Running a oneliner on my 7 disks: $ time bash -c 'smartctl --scan | cut -d\ -f1 | grep -v pass6 | xargs -I= smartctl -x ='`
[...]
0.209u 0.057s 0:01.27 19.6% 786+302k 0+0io 0pf+0w Is it go that is so slow? Have you benchmarked the code and what do you get if you run the above oneliner? |
Sure, here's my execution on the 96 HDD machine: $ time bash -c "smartctl --scan | cut -d\ -f1,2,3 | xargs -I= bash -c 'smartctl -x '="
[...]
real 5m15.047s
user 0m1.832s
sys 0m36.006s |
OK, interesting, that is about 20 times slower than I expected. |
That's why golang concurrency rocks! I'm going to re-submit with some of the changes you requested in the next couple of days. This is good collaboration especially when comparing workloads in our respective environments. This will ultimately make telegraf strong should @sparrc et al accept this PR. |
How slow is I think it's reasonable to scan on every gather and have a flag for disabling this and just run it on startup if you have slow setup like yours. Most people probably have less than 10 disks. I'll see if there are any differences FreeBSD <-> Linux in term of speed. |
Feature wise I would like to see the following enhancements:
|
OK, will look into that. FYI, the RHEL7 kernel only supports smartctl 6.2 at the moment. I know more bleeding-edge kernels (4.x series) support higher revs so I'll see if I can find a host to test the data on. The smartctl --scan is fine, it executes quickly. The return of data when parsing each individual disk is what's time consuming so should be fine to add the scan. |
nevermind, if there's essentially no overhead to the scan then it could be run per-gather |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sebito91 @rickard-von-essen I went through this pull request as well as #2449 and wrote some suggestions and questions. This will hopefully help us get an idea of how much work is left and which pull requests to move forward with.
As I understand it, the two plugins gather different types of data, with this plugin gathering from the error counter log and #2449 gathering from the attribute table. So maybe it even makes sense to merge the plugins at least from a data model perspective.
Let's discuss what our next steps are before fixing the comments.
|
||
3. A corollary to the `include` set of disks is the `exclude` set. In this case, | ||
the scan will still be run unless `include` is defined but will limit those to | ||
query for based on the set in the `exclude`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file will need updated to follow the example readme a bit more closely, especially with respect to the a description of the output.
## NOTE: If you specify an include list, this will skip the smartctl --scan function | ||
## and only collect for those you've requested (minus any exclusions). | ||
include = ["/dev/bus/0 -d megaraid,24"] | ||
exclude = ["/dev/sda -d scsi"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default should be include = []
, you can comment out these lines with the default values and move examples into your comment above.
cmd = append(cmd, disk...) | ||
|
||
data = Disk{Name: "empty"} | ||
if out, err = exec.Command(s.SudoPath, cmd...).CombinedOutput(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think sudo
use needs to be optional, since there are many way to handle permissions. You can add a use_sudo field like we did in [fail2ban(https://github.com/influxdata/telegraf/blob/ca9cec2c84e7c8796c2e8a747d17d1ad86ce1ae6/plugins/inputs/fail2ban/README.md#configuration), or it might be more readable and extensible to have something like ansible: become_method = "sudo"
Should probably run sudo with -n
to avoid interactive prompt if not configured correctly.
@@ -0,0 +1,447 @@ | |||
// +build linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't bother to build only on linux, it is probably possible to run this or something similar on another platform.
From my experience with build systems, the good ones set build flags based on available/desired features and not by platform, and unfortunately we don't have a good method to provide this today.
s.DiskFailed[err.Name] = err.Error | ||
a++ | ||
default: | ||
time.Sleep(50 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
return fmt.Errorf("could not parse all the disks in our list: %v\n", err) | ||
} | ||
|
||
for _, each := range s.DiskOutput { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename each
-> disk
fields["max_temp"] = each.Stats.MaxTemp | ||
|
||
// add the read error row | ||
if len(each.Stats.ReadError) == 7 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the different Stats, it seems to me that if the length is wrong we should report an error.
@@ -0,0 +1,375 @@ | |||
// +build linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the tests that run as part of -short
, they should not require sudo or smartctl to pass. For the full tests, it would be okay but they should pass on any system with these utilities, so they can't really depend on output. Probably best to make them all not require sudo/smartctl.
|
||
// ParseStringSlice is a generic function that takes a given regexp and applies to a buf, placing | ||
// output into the dataVar | ||
func (s *SmartCtl) ParseStringSlice(regexp *regexp.Regexp, buf *bytes.Buffer, dataVar *[]string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use return values for output on the Parse* functions
|
||
# Sample Config | ||
``` | ||
## smartctl requires installation of the smartmontools for your distro (linux only) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is optional but it would be a nice touch to add the ability to override the smartctl location. smartctl_path = /usr/local/bin/smartctl
.
@danielnelson, @rickard-von-essen do you want to just merge these into #2449 and close out? Happy to just go with one solution, but leave it to you to determine which one. |
That sounds great by me, thanks @rickard-von-essen. Since you've got the thumbs-up from active use in the community, would be best to merge that way. Let me know if you have any questions, otherwise I'll close this one for now. |
Required for all PRs:
This plugin attempts to query basic S.M.A.R.T attributes from HDDs and SSDs on linux (only for now) using smartmontools. The README gives further explanation of usage + requirements, but this plugin attempts to satisfy the basic elements from #1880.
Please let me know any feedback you may have on how to improve/amend the plugin.