initial commit smartctl, telegraf #2319

sebito91 · 2017-01-25T17:03:35Z

Required for all PRs:

CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
Sign CLA (if not already signed)
README.md updated (if adding a new plugin)

This plugin attempts to query basic S.M.A.R.T attributes from HDDs and SSDs on linux (only for now) using smartmontools. The README gives further explanation of usage + requirements, but this plugin attempts to satisfy the basic elements from #1880.

Please let me know any feedback you may have on how to improve/amend the plugin.

NOTE: data tested with include set to "/dev/bus/0 -d megaraid,8"
[root@localhost sebtest]# ./telegraf --test --config /etc/telegraf/telegraf_seb.conf --input-filter smartctl
* Plugin: inputs.smartctl, Collection 1
> smartctl,name=/dev/bus/0_megaraid_8,transport=SAS,host=localhost,vendor=HGST,product=HUC101212CSS600,block_size=512,writeback=Disabled,env=production,read_cache=Enabled,sr=metrics,dc=carf,bu=linux,serial=L0G77T9G,rpm=10000,cls=server,trd=false ecc_corr_delay_write=0,total_err_corr_write=0,data_write=9372.49,ecc_corr_delay_verify=2,ecc_reverify=0,current_temp=26,ecc_reread=0,corr_algo_verify=412069,uncorr_err_verify=0,total_err_corr_read=0,data_read=41322.953,ecc_corr_fast_read=0,corr_algo_read=573068,uncorr_err_read=0,ecc_corr_fast_write=0,ecc_corr_fast_verify=0,data_verify=42.444,health=1,max_temp=85,corr_algo_write=10933,uncorr_err_write=0,total_err_corr_verify=2,ecc_corr_delay_read=0,ecc_rewrite=0 1485206631000000000

sebito91 · 2017-02-13T22:42:20Z

I think #2402 has some extra bits we could incorporate here, good looking out @rickard-von-essen, @sparrc. Happy to combine efforts to make this plugin more complete/usable/amazing.

rickard-von-essen · 2017-02-14T07:49:02Z

Cool! I started to work on this just a few days before you opened this. I'll review and test this.

rickard-von-essen · 2017-02-14T08:04:53Z

Some initial comments:

This should not be limited to Linux. According to the docs for smartmontools it works on:
- Linux
- FreeBSD
- Darwin
- NetBSD
- Solaris
- Cygwin
- Windows
- OS/2, eComStation
- OpenBSD
  I have tested my code mostly on Darwin and FreeBSD.
sudo should be optional.
I'm a bit worried that there are some version differences. I run 6.5 on all my machines. I'll test and see how it works out.

sebito91 · 2017-02-14T12:56:49Z

Agreed that this collector should not be limited to Linux, but my experience with OSes only extends about that far, at least as far as any Linux-variant that's not FreeBSD :D.

Also, how do you propose collecting counters if sudo is optional? Are you recommending that the user run these collectors as root? If so, that's a danger on its own that should be avoided, and to collect data proper in Linux we'll need root-like perms.

Long story short, I was inspired by #1880 and selfishly coded up a solution for my known environment. There is some good discussion in that thread, absolutely happy to open this discussion up further!

rickard-von-essen · 2017-02-14T14:09:51Z

Also, how do you propose collecting counters if sudo is optional?

At least on Darwin/macOS there is no need for sudo.

sparrc · 2017-02-14T14:51:01Z

At least on Darwin/macOS there is no need for sudo.

Nobody runs telegraf in production on macOS, and from what I can tell macOS is the only OS that can expose these metrics in user-space without significant permissions fiddling.

as to supporting other OSes in general, do we know if the output of smartmontools is consistent?

rickard-von-essen · 2017-02-14T16:53:27Z

as to supporting other OSes in general, do we know if the output of smartmontools is consistent?

Seems consistent, since the code is basically the same and it doesn't present OS dependent information. I think the concerns are for different versions of smartctl and vendors of disks.

Nobody runs telegraf in production on macOS

Not a real argument, and at least I have mac mini's in a "production like" setup which I would like to monitor for hw failures.

macOS is the only OS that can expose these metrics in user-space without significant permissions fiddling

Agree, but OpenBSD for example don't have sudo in the base, they switched to doas. There is a much better pattern:

Add a config option for the full path of smartctl. Then if you need to run it under sudo you just make a wrapper:

#!/bin/bash
sudo /usr/bin/smartctl $@

[[inputs.smartctl]]
  path = "/etc/telegraf/wrappers/smartctl"

sparrc · 2017-02-14T16:59:17Z

fair enough, I wouldn't mind if we added a "path" argument to the plugin, many other plugins that run external processes expose that as well

rickard-von-essen · 2017-02-14T17:24:24Z

@sebito91 there is some issues with parsing.

I removed sudo and added some debug printouts. Scan works nicely it finds my 7 ata disks and the CAM passthrough device (which should fail smartctl) but only three devices get parsed, and many values are not correctly parsed.

$ ./telegraf_bsd --config smartctl.conf --test
* Plugin: inputs.smartctl, Collection 1
DEBUG: disks: /dev/ada0 -d atacam|/dev/ada1 -d atacam|/dev/ada2 -d atacam|/dev/ada3 -d atacam|/dev/ada4 -d atacam|/dev/ada5 -d atacam|/dev/pass6 -d atacam|/dev/ada6 -d atacam
> smartctl,host=freenas.local,vendor=none,product=none,block_size=none,rpm=5400,transport=none,read_cache=none,writeback=none,name=/dev/ada0_atacam,serial=none health=0,current_temp=0,max_temp=0 1487091972000000000
> smartctl,vendor=none,block_size=none,serial=none,transport=none,host=freenas.local,name=/dev/ada6_atacam,product=none,rpm=none,read_cache=none,writeback=none health=0,current_temp=0,max_temp=0 1487091972000000000
> smartctl,block_size=none,read_cache=none,writeback=none,host=freenas.local,name=/dev/ada3_atacam,vendor=none,rpm=5400,transport=none,product=none,serial=none current_temp=0,max_temp=0,health=0 1487091972000000000

$ smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)

Sample output of smartctl --scan | cut -d# -f1 | xargs -I= bash -c 'smartctl -x = >> smartctl_freebsd.txt' see gist

rickard-von-essen · 2017-02-14T20:53:06Z

plugins/inputs/smartctl/smartctl.go

+	"Serial":    regexp.MustCompile(`Serial number:\s+(\w+)`),
+	"Rotation":  regexp.MustCompile(`Rotation Rate:\s+(\w+)`),
+	"Transport": regexp.MustCompile(`Transport protocol:\s+(\w+)`),
+	"Health":    regexp.MustCompile(`SMART Health Status:\s+(\w+)`),


In 6.5: SMART overall-health self-assessment test result: PASSED

rickard-von-essen · 2017-02-14T20:54:45Z

plugins/inputs/smartctl/smartctl.go

+// tagFinders is a global map of Disk struct elements to corresponding regexp
+var tagFinders = map[string]*regexp.Regexp{
+	"Vendor":    regexp.MustCompile(`Vendor:\s+(\w+)`),
+	"Product":   regexp.MustCompile(`Product:\s+(\w+)`),


In 6.5: Model Family: Western Digital Red and Device Model: WDC WD30EFRX-68EUZN0

rickard-von-essen · 2017-02-14T20:55:16Z

plugins/inputs/smartctl/smartctl.go

+	"Vendor":    regexp.MustCompile(`Vendor:\s+(\w+)`),
+	"Product":   regexp.MustCompile(`Product:\s+(\w+)`),
+	"Block":     regexp.MustCompile(`Logical block size:\s+(\w+)`),
+	"Serial":    regexp.MustCompile(`Serial number:\s+(\w+)`),


In 6.5 Serial Number: WD-WMC4N0905071

rickard-von-essen · 2017-02-14T20:56:04Z

plugins/inputs/smartctl/smartctl.go

+
+// fieldFinders is a global map of DiskStats struct elements
+var fieldFinders = map[string]*regexp.Regexp{
+	"CurrentTemp": regexp.MustCompile(`Current Drive Temperature:\s+([0-9]+)`),


In 6.5 Current Temperature: 37 Celsius

sebito91 · 2017-02-14T21:18:12Z

So it looks like there are some oddities across the versions. Thanks for the gist input, I'll drop that into the smartctl_test.go file and see what comes out from it.

I'd be surprised that one 3 of the 7 drives return data, hopefully it's not a concurrency issue :/

rickard-von-essen · 2017-02-15T11:37:54Z

Yesterday I started to dig into this code and see if could get it to work on Darwin and FreeBSD with smartmontools 6.5. You can see my changes on @smartctl-2 mainly concerning Go style.

I have some comments on the code:

--scan has to be done on every Gather(), so hot-swapped devices are handled.
Don't use the Smartctl struct to pass around state, use proper functions. *)
Unnecessary use of go routines and channels. *)

*) My guess is that these causes my issue with only 3 out of 7 disks showing up.

Also I think handling (smart) attributes in a more generic way is more future proof. I.e.
Instead of:

smartctl,<tags> field0=a,...fieldN=b 1485206631000000000

It's better to do:

smartctl,<tags>,name="field0" field0=a 1485206631000000000

Since it makes it easy to parse all attributes, even for ones the we haven't seen.

I it would be good with a list of the features that we think is missing in #2402. @sebito91 & @sparrc thoughts, comments?

sparrc · 2017-02-15T12:36:07Z

smartctl,,name="field0" field0=a 1485206631000000000

Can you explain why you would want to do this? This doesn't seem like good schema design to me. Give some examples of queries and/or GROUP BYs that this would help you to perform.

Since it makes it easy to parse all attributes, even for ones the we haven't seen.

What do you mean by this? you can still add arbitrary fields to the plugin without having the name tag.

rickard-von-essen · 2017-02-15T19:51:22Z

smartctl,,name="field0" field0=a 1485206631000000000

This doesn't seem like good schema design to me.

On the contrary this is good be both because of storage model *) and in a UX perspective. Having lots of fields (many of them empty) are inefficient. On the UX side it's tedious to work with measurements with lots of fields (50+). You are probably interest in all attributes for one device or one or a few attributes for all/a group of devices.

Since it makes it easy to parse all attributes, even for ones the we haven't seen.

What do you mean by this? you can still add arbitrary fields to the plugin without having the name tag.

Yes, but it rather inconvenient to handle unknown (by smartmontool) attributes such as:
169 Unknown_Attribute.

UPDATE: *) this might be less of an issue since InfluxDB switched from LevelDB.

sebito91 · 2017-02-15T20:17:48Z

Replying to your earlier comments, I think that we're on different pages here with the idea for this plugin

Although I understand the (potential) benefit of detecting hot-swappable HDDs the real concern is constantly running expensive --scan operations across all HDDs. The intent is collect once, continuously parse. Of course this is still using a binary to query the data and not actually reading from sysfs which would be better, but at least we can limit some of the impact.
Not sure there's much of a penalty in making functions methods instead of distinct functions. If this is something that @sparrc agrees with then I'll absolutely change, but I think it's fine as a method.
I have machines in our environment with >90 drives, there is positively no way you can query all drives in an interval without using concurrency.
And lastly, the tags/fields seem like they're going to be a contentious issue based on the version of the software installed. I'd like to settle on a defined set that would prove beneficial and see those implemented. I like the idea of some of your deeper queries, I'll see about adding them and update the thread.

rickard-von-essen · 2017-02-15T20:51:53Z

--scan takes ~0.02 seconds for me and -x for one drive is ~0.07 seconds.

Running a oneliner on my 7 disks:

$ time bash -c 'smartctl --scan | cut -d\  -f1 | grep -v pass6 | xargs -I= smartctl -x ='` 
[...]
0.209u 0.057s 0:01.27 19.6%     786+302k 0+0io 0pf+0w

Is it go that is so slow? Have you benchmarked the code and what do you get if you run the above oneliner?

sebito91 · 2017-02-15T22:27:22Z

Sure, here's my execution on the 96 HDD machine:

$ time bash -c "smartctl --scan | cut -d\  -f1,2,3 | xargs -I= bash -c 'smartctl -x '="
[...]
real      5m15.047s
user      0m1.832s
sys       0m36.006s

rickard-von-essen · 2017-02-16T00:13:50Z

real 5m15.047s

OK, interesting, that is about 20 times slower than I expected.

sebito91 · 2017-02-16T01:33:46Z

That's why golang concurrency rocks!

I'm going to re-submit with some of the changes you requested in the next couple of days. This is good collaboration especially when comparing workloads in our respective environments. This will ultimately make telegraf strong should @sparrc et al accept this PR.

rickard-von-essen · 2017-02-16T07:48:21Z

How slow is time smartctl --scan?

I think it's reasonable to scan on every gather and have a flag for disabling this and just run it on startup if you have slow setup like yours. Most people probably have less than 10 disks. I'll see if there are any differences FreeBSD <-> Linux in term of speed.

rickard-von-essen · 2017-02-16T07:53:45Z

Feature wise I would like to see the following enhancements:

smartctl version detection and handling of differences in outputs for versions: 5.42, 6.[0-5].
Scan on every gather with a flag for opt out.
Improved logging of parse errors
A clear naming schema and some docs so it's obvious where each value comes from.

sebito91 · 2017-02-16T15:49:56Z

OK, will look into that.

FYI, the RHEL7 kernel only supports smartctl 6.2 at the moment. I know more bleeding-edge kernels (4.x series) support higher revs so I'll see if I can find a host to test the data on.

The smartctl --scan is fine, it executes quickly. The return of data when parsing each individual disk is what's time consuming so should be fine to add the scan.

sparrc · 2017-02-16T16:43:45Z

~~Might be worth having a config entry for scanning on each gather.~~

~~I would imagine that most people are not going to be hot-swapping drives and would want that turned off, but obviously there is a use-case to turn it on as well.~~

nevermind, if there's essentially no overhead to the scan then it could be run per-gather

danielnelson

@sebito91 @rickard-von-essen I went through this pull request as well as #2449 and wrote some suggestions and questions. This will hopefully help us get an idea of how much work is left and which pull requests to move forward with.

As I understand it, the two plugins gather different types of data, with this plugin gathering from the error counter log and #2449 gathering from the attribute table. So maybe it even makes sense to merge the plugins at least from a data model perspective.

Let's discuss what our next steps are before fixing the comments.

danielnelson · 2017-08-02T21:52:36Z

plugins/inputs/smartctl/README.md

+
+3. A corollary to the `include` set of disks is the `exclude` set. In this case,
+the scan will still be run unless `include` is defined but will limit those to 
+query for based on the set in the `exclude`.


This file will need updated to follow the example readme a bit more closely, especially with respect to the a description of the output.

danielnelson · 2017-08-02T21:55:25Z

plugins/inputs/smartctl/smartctl.go

+  ## NOTE: If you specify an include list, this will skip the smartctl --scan function
+  ## and only collect for those you've requested (minus any exclusions).
+  include = ["/dev/bus/0 -d megaraid,24"]
+  exclude = ["/dev/sda -d scsi"]


Default should be include = [], you can comment out these lines with the default values and move examples into your comment above.

danielnelson · 2017-08-02T22:00:42Z

plugins/inputs/smartctl/smartctl.go

+	cmd = append(cmd, disk...)
+
+	data = Disk{Name: "empty"}
+	if out, err = exec.Command(s.SudoPath, cmd...).CombinedOutput(); err != nil {


I think sudo use needs to be optional, since there are many way to handle permissions. You can add a use_sudo field like we did in [fail2ban(https://github.com/influxdata/telegraf/blob/ca9cec2c84e7c8796c2e8a747d17d1ad86ce1ae6/plugins/inputs/fail2ban/README.md#configuration), or it might be more readable and extensible to have something like ansible: become_method = "sudo"

Should probably run sudo with -n to avoid interactive prompt if not configured correctly.

danielnelson · 2017-08-02T22:05:13Z

plugins/inputs/smartctl/smartctl.go

@@ -0,0 +1,447 @@
+// +build linux


Don't bother to build only on linux, it is probably possible to run this or something similar on another platform.

From my experience with build systems, the good ones set build flags based on available/desired features and not by platform, and unfortunately we don't have a good method to provide this today.

danielnelson · 2017-08-02T22:08:12Z

plugins/inputs/smartctl/smartctl.go

+			s.DiskFailed[err.Name] = err.Error
+			a++
+		default:
+			time.Sleep(50 * time.Millisecond)


What's this for?

danielnelson · 2017-08-02T22:26:55Z

plugins/inputs/smartctl/smartctl.go

+		return fmt.Errorf("could not parse all the disks in our list: %v\n", err)
+	}
+
+	for _, each := range s.DiskOutput {


Rename each -> disk

danielnelson · 2017-08-02T22:28:07Z

plugins/inputs/smartctl/smartctl.go

+		fields["max_temp"] = each.Stats.MaxTemp
+
+		// add the read error row
+		if len(each.Stats.ReadError) == 7 {


On the different Stats, it seems to me that if the length is wrong we should report an error.

danielnelson · 2017-08-02T22:35:46Z

plugins/inputs/smartctl/smartctl_test.go

@@ -0,0 +1,375 @@
+// +build linux


For the tests that run as part of -short, they should not require sudo or smartctl to pass. For the full tests, it would be okay but they should pass on any system with these utilities, so they can't really depend on output. Probably best to make them all not require sudo/smartctl.

danielnelson · 2017-08-02T22:41:20Z

plugins/inputs/smartctl/smartctl.go

+
+// ParseStringSlice is a generic function that takes a given regexp and applies to a buf, placing
+// output into the dataVar
+func (s *SmartCtl) ParseStringSlice(regexp *regexp.Regexp, buf *bytes.Buffer, dataVar *[]string) {


Use return values for output on the Parse* functions

danielnelson · 2017-08-02T22:47:29Z

plugins/inputs/smartctl/README.md

+
+# Sample Config 
+```
+  ## smartctl requires installation of the smartmontools for your distro (linux only)


This is optional but it would be a nice touch to add the ability to override the smartctl location. smartctl_path = /usr/local/bin/smartctl.

sebito91 · 2017-08-07T19:30:49Z

@danielnelson, @rickard-von-essen do you want to just merge these into #2449 and close out? Happy to just go with one solution, but leave it to you to determine which one.

rickard-von-essen · 2017-08-07T19:53:25Z

@sebito91 I can try to bring in stuff from this into #2449 the coming days. (I'm planning on adding similar data as this and make reporting of attributes optional). Does that sound good to you?

sebito91 · 2017-08-08T17:00:24Z

That sounds great by me, thanks @rickard-von-essen. Since you've got the thumbs-up from active use in the community, would be best to merge that way.

Let me know if you have any questions, otherwise I'll close this one for now.

initial commit smartctl, telegraf

3e27944

sebito91 force-pushed the smartctl branch from 7daddd7 to 3e27944 Compare January 25, 2017 17:36

sparrc added this to the 1.3.0 milestone Jan 25, 2017

sparrc mentioned this pull request Feb 13, 2017

smart: Gather S.M.A.R.T. information from storage devices #2402

Closed

3 tasks

rickard-von-essen reviewed Feb 14, 2017

View reviewed changes

rickard-von-essen mentioned this pull request Feb 24, 2017

smart: Gather S.M.A.R.T. information from storage devices #2449

Merged

3 tasks

danielnelson modified the milestones: 1.3.0, 1.4.0 Apr 20, 2017

danielnelson modified the milestones: 1.4.0, 1.3.0 Apr 20, 2017

danielnelson reviewed Aug 2, 2017

View reviewed changes

sebito91 closed this Aug 8, 2017

initial commit smartctl, telegraf #2319

initial commit smartctl, telegraf #2319

Conversation

sebito91 commented Jan 25, 2017 • edited Loading

Required for all PRs:

sebito91 commented Feb 13, 2017

rickard-von-essen commented Feb 14, 2017

rickard-von-essen commented Feb 14, 2017

sebito91 commented Feb 14, 2017

rickard-von-essen commented Feb 14, 2017

sparrc commented Feb 14, 2017

rickard-von-essen commented Feb 14, 2017

sparrc commented Feb 14, 2017

rickard-von-essen commented Feb 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebito91 commented Feb 14, 2017

rickard-von-essen commented Feb 15, 2017

sparrc commented Feb 15, 2017

rickard-von-essen commented Feb 15, 2017 • edited Loading

sebito91 commented Feb 15, 2017

rickard-von-essen commented Feb 15, 2017

sebito91 commented Feb 15, 2017

rickard-von-essen commented Feb 16, 2017

sebito91 commented Feb 16, 2017

rickard-von-essen commented Feb 16, 2017

rickard-von-essen commented Feb 16, 2017 • edited Loading

sebito91 commented Feb 16, 2017

sparrc commented Feb 16, 2017 • edited Loading

danielnelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebito91 commented Aug 7, 2017

rickard-von-essen commented Aug 7, 2017

sebito91 commented Aug 8, 2017

sebito91 commented Jan 25, 2017 •

edited

Loading

rickard-von-essen commented Feb 14, 2017 •

edited

Loading

rickard-von-essen commented Feb 15, 2017 •

edited

Loading

rickard-von-essen commented Feb 16, 2017 •

edited

Loading

sparrc commented Feb 16, 2017 •

edited

Loading