Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial commit smartctl, telegraf #2319

Closed
wants to merge 1 commit into from

Conversation

sebito91
Copy link
Contributor

@sebito91 sebito91 commented Jan 25, 2017

Required for all PRs:

  • CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
  • Sign CLA (if not already signed)
  • README.md updated (if adding a new plugin)

This plugin attempts to query basic S.M.A.R.T attributes from HDDs and SSDs on linux (only for now) using smartmontools. The README gives further explanation of usage + requirements, but this plugin attempts to satisfy the basic elements from #1880.

Please let me know any feedback you may have on how to improve/amend the plugin.

NOTE: data tested with include set to "/dev/bus/0 -d megaraid,8"
[root@localhost sebtest]# ./telegraf --test --config /etc/telegraf/telegraf_seb.conf --input-filter smartctl
* Plugin: inputs.smartctl, Collection 1
> smartctl,name=/dev/bus/0_megaraid_8,transport=SAS,host=localhost,vendor=HGST,product=HUC101212CSS600,block_size=512,writeback=Disabled,env=production,read_cache=Enabled,sr=metrics,dc=carf,bu=linux,serial=L0G77T9G,rpm=10000,cls=server,trd=false ecc_corr_delay_write=0,total_err_corr_write=0,data_write=9372.49,ecc_corr_delay_verify=2,ecc_reverify=0,current_temp=26,ecc_reread=0,corr_algo_verify=412069,uncorr_err_verify=0,total_err_corr_read=0,data_read=41322.953,ecc_corr_fast_read=0,corr_algo_read=573068,uncorr_err_read=0,ecc_corr_fast_write=0,ecc_corr_fast_verify=0,data_verify=42.444,health=1,max_temp=85,corr_algo_write=10933,uncorr_err_write=0,total_err_corr_verify=2,ecc_corr_delay_read=0,ecc_rewrite=0 1485206631000000000

@sebito91
Copy link
Contributor Author

I think #2402 has some extra bits we could incorporate here, good looking out @rickard-von-essen, @sparrc. Happy to combine efforts to make this plugin more complete/usable/amazing.

@rickard-von-essen
Copy link
Contributor

Cool! I started to work on this just a few days before you opened this. I'll review and test this.

@rickard-von-essen
Copy link
Contributor

Some initial comments:

  • This should not be limited to Linux. According to the docs for smartmontools it works on:
    • Linux
    • FreeBSD
    • Darwin
    • NetBSD
    • Solaris
    • Cygwin
    • Windows
    • OS/2, eComStation
    • OpenBSD
      I have tested my code mostly on Darwin and FreeBSD.
  • sudo should be optional.
  • I'm a bit worried that there are some version differences. I run 6.5 on all my machines. I'll test and see how it works out.

@sebito91
Copy link
Contributor Author

Agreed that this collector should not be limited to Linux, but my experience with OSes only extends about that far, at least as far as any Linux-variant that's not FreeBSD :D.

Also, how do you propose collecting counters if sudo is optional? Are you recommending that the user run these collectors as root? If so, that's a danger on its own that should be avoided, and to collect data proper in Linux we'll need root-like perms.

Long story short, I was inspired by #1880 and selfishly coded up a solution for my known environment. There is some good discussion in that thread, absolutely happy to open this discussion up further!

@rickard-von-essen
Copy link
Contributor

Also, how do you propose collecting counters if sudo is optional?

At least on Darwin/macOS there is no need for sudo.

@sparrc
Copy link
Contributor

sparrc commented Feb 14, 2017

At least on Darwin/macOS there is no need for sudo.

Nobody runs telegraf in production on macOS, and from what I can tell macOS is the only OS that can expose these metrics in user-space without significant permissions fiddling.

as to supporting other OSes in general, do we know if the output of smartmontools is consistent?

@rickard-von-essen
Copy link
Contributor

as to supporting other OSes in general, do we know if the output of smartmontools is consistent?

Seems consistent, since the code is basically the same and it doesn't present OS dependent information. I think the concerns are for different versions of smartctl and vendors of disks.

Nobody runs telegraf in production on macOS

Not a real argument, and at least I have mac mini's in a "production like" setup which I would like to monitor for hw failures.

macOS is the only OS that can expose these metrics in user-space without significant permissions fiddling

Agree, but OpenBSD for example don't have sudo in the base, they switched to doas. There is a much better pattern:

Add a config option for the full path of smartctl. Then if you need to run it under sudo you just make a wrapper:

#!/bin/bash
sudo /usr/bin/smartctl $@
[[inputs.smartctl]]
  path = "/etc/telegraf/wrappers/smartctl"

@sparrc
Copy link
Contributor

sparrc commented Feb 14, 2017

fair enough, I wouldn't mind if we added a "path" argument to the plugin, many other plugins that run external processes expose that as well

@rickard-von-essen
Copy link
Contributor

rickard-von-essen commented Feb 14, 2017

@sebito91 there is some issues with parsing.

I removed sudo and added some debug printouts. Scan works nicely it finds my 7 ata disks and the CAM passthrough device (which should fail smartctl) but only three devices get parsed, and many values are not correctly parsed.

$ ./telegraf_bsd --config smartctl.conf --test
* Plugin: inputs.smartctl, Collection 1
DEBUG: disks: /dev/ada0 -d atacam|/dev/ada1 -d atacam|/dev/ada2 -d atacam|/dev/ada3 -d atacam|/dev/ada4 -d atacam|/dev/ada5 -d atacam|/dev/pass6 -d atacam|/dev/ada6 -d atacam
> smartctl,host=freenas.local,vendor=none,product=none,block_size=none,rpm=5400,transport=none,read_cache=none,writeback=none,name=/dev/ada0_atacam,serial=none health=0,current_temp=0,max_temp=0 1487091972000000000
> smartctl,vendor=none,block_size=none,serial=none,transport=none,host=freenas.local,name=/dev/ada6_atacam,product=none,rpm=none,read_cache=none,writeback=none health=0,current_temp=0,max_temp=0 1487091972000000000
> smartctl,block_size=none,read_cache=none,writeback=none,host=freenas.local,name=/dev/ada3_atacam,vendor=none,rpm=5400,transport=none,product=none,serial=none current_temp=0,max_temp=0,health=0 1487091972000000000

$ smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)

Sample output of smartctl --scan | cut -d# -f1 | xargs -I= bash -c 'smartctl -x = >> smartctl_freebsd.txt' see gist

"Serial": regexp.MustCompile(`Serial number:\s+(\w+)`),
"Rotation": regexp.MustCompile(`Rotation Rate:\s+(\w+)`),
"Transport": regexp.MustCompile(`Transport protocol:\s+(\w+)`),
"Health": regexp.MustCompile(`SMART Health Status:\s+(\w+)`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 6.5: SMART overall-health self-assessment test result: PASSED

// tagFinders is a global map of Disk struct elements to corresponding regexp
var tagFinders = map[string]*regexp.Regexp{
"Vendor": regexp.MustCompile(`Vendor:\s+(\w+)`),
"Product": regexp.MustCompile(`Product:\s+(\w+)`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 6.5: Model Family: Western Digital Red and Device Model: WDC WD30EFRX-68EUZN0

"Vendor": regexp.MustCompile(`Vendor:\s+(\w+)`),
"Product": regexp.MustCompile(`Product:\s+(\w+)`),
"Block": regexp.MustCompile(`Logical block size:\s+(\w+)`),
"Serial": regexp.MustCompile(`Serial number:\s+(\w+)`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 6.5 Serial Number: WD-WMC4N0905071


// fieldFinders is a global map of DiskStats struct elements
var fieldFinders = map[string]*regexp.Regexp{
"CurrentTemp": regexp.MustCompile(`Current Drive Temperature:\s+([0-9]+)`),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 6.5 Current Temperature: 37 Celsius

@sebito91
Copy link
Contributor Author

So it looks like there are some oddities across the versions. Thanks for the gist input, I'll drop that into the smartctl_test.go file and see what comes out from it.

I'd be surprised that one 3 of the 7 drives return data, hopefully it's not a concurrency issue :/

@rickard-von-essen
Copy link
Contributor

Yesterday I started to dig into this code and see if could get it to work on Darwin and FreeBSD with smartmontools 6.5. You can see my changes on @smartctl-2 mainly concerning Go style.

I have some comments on the code:

  • --scan has to be done on every Gather(), so hot-swapped devices are handled.
  • Don't use the Smartctl struct to pass around state, use proper functions. *)
  • Unnecessary use of go routines and channels. *)

*) My guess is that these causes my issue with only 3 out of 7 disks showing up.

  • Also I think handling (smart) attributes in a more generic way is more future proof. I.e.
    Instead of:
smartctl,<tags> field0=a,...fieldN=b 1485206631000000000

It's better to do:

smartctl,<tags>,name="field0" field0=a 1485206631000000000

Since it makes it easy to parse all attributes, even for ones the we haven't seen.

I it would be good with a list of the features that we think is missing in #2402. @sebito91 & @sparrc thoughts, comments?

@sparrc
Copy link
Contributor

sparrc commented Feb 15, 2017

smartctl,,name="field0" field0=a 1485206631000000000

Can you explain why you would want to do this? This doesn't seem like good schema design to me. Give some examples of queries and/or GROUP BYs that this would help you to perform.

Since it makes it easy to parse all attributes, even for ones the we haven't seen.

What do you mean by this? you can still add arbitrary fields to the plugin without having the name tag.

@rickard-von-essen
Copy link
Contributor

rickard-von-essen commented Feb 15, 2017

smartctl,,name="field0" field0=a 1485206631000000000

This doesn't seem like good schema design to me.

On the contrary this is good be both because of storage model *) and in a UX perspective. Having lots of fields (many of them empty) are inefficient. On the UX side it's tedious to work with measurements with lots of fields (50+). You are probably interest in all attributes for one device or one or a few attributes for all/a group of devices.

Since it makes it easy to parse all attributes, even for ones the we haven't seen.

What do you mean by this? you can still add arbitrary fields to the plugin without having the name tag.

Yes, but it rather inconvenient to handle unknown (by smartmontool) attributes such as:
169 Unknown_Attribute.

UPDATE: *) this might be less of an issue since InfluxDB switched from LevelDB.

@sebito91
Copy link
Contributor Author

Replying to your earlier comments, I think that we're on different pages here with the idea for this plugin

  1. Although I understand the (potential) benefit of detecting hot-swappable HDDs the real concern is constantly running expensive --scan operations across all HDDs. The intent is collect once, continuously parse. Of course this is still using a binary to query the data and not actually reading from sysfs which would be better, but at least we can limit some of the impact.

  2. Not sure there's much of a penalty in making functions methods instead of distinct functions. If this is something that @sparrc agrees with then I'll absolutely change, but I think it's fine as a method.

  3. I have machines in our environment with >90 drives, there is positively no way you can query all drives in an interval without using concurrency.

  4. And lastly, the tags/fields seem like they're going to be a contentious issue based on the version of the software installed. I'd like to settle on a defined set that would prove beneficial and see those implemented. I like the idea of some of your deeper queries, I'll see about adding them and update the thread.

@rickard-von-essen
Copy link
Contributor

--scan takes ~0.02 seconds for me and -x for one drive is ~0.07 seconds.

Running a oneliner on my 7 disks:

$ time bash -c 'smartctl --scan | cut -d\  -f1 | grep -v pass6 | xargs -I= smartctl -x ='` 
[...]
0.209u 0.057s 0:01.27 19.6%     786+302k 0+0io 0pf+0w

Is it go that is so slow? Have you benchmarked the code and what do you get if you run the above oneliner?

@sebito91
Copy link
Contributor Author

Sure, here's my execution on the 96 HDD machine:

$ time bash -c "smartctl --scan | cut -d\  -f1,2,3 | xargs -I= bash -c 'smartctl -x '="
[...]
real      5m15.047s
user      0m1.832s
sys       0m36.006s

@rickard-von-essen
Copy link
Contributor

real 5m15.047s

OK, interesting, that is about 20 times slower than I expected.

@sebito91
Copy link
Contributor Author

That's why golang concurrency rocks!

I'm going to re-submit with some of the changes you requested in the next couple of days. This is good collaboration especially when comparing workloads in our respective environments. This will ultimately make telegraf strong should @sparrc et al accept this PR.

@rickard-von-essen
Copy link
Contributor

How slow is time smartctl --scan?

I think it's reasonable to scan on every gather and have a flag for disabling this and just run it on startup if you have slow setup like yours. Most people probably have less than 10 disks. I'll see if there are any differences FreeBSD <-> Linux in term of speed.

@rickard-von-essen
Copy link
Contributor

rickard-von-essen commented Feb 16, 2017

Feature wise I would like to see the following enhancements:

  • smartctl version detection and handling of differences in outputs for versions: 5.42, 6.[0-5].
  • Scan on every gather with a flag for opt out.
  • Improved logging of parse errors
  • A clear naming schema and some docs so it's obvious where each value comes from.

@sebito91
Copy link
Contributor Author

OK, will look into that.

FYI, the RHEL7 kernel only supports smartctl 6.2 at the moment. I know more bleeding-edge kernels (4.x series) support higher revs so I'll see if I can find a host to test the data on.

The smartctl --scan is fine, it executes quickly. The return of data when parsing each individual disk is what's time consuming so should be fine to add the scan.

@sparrc
Copy link
Contributor

sparrc commented Feb 16, 2017

Might be worth having a config entry for scanning on each gather.

I would imagine that most people are not going to be hot-swapping drives and would want that turned off, but obviously there is a use-case to turn it on as well.

nevermind, if there's essentially no overhead to the scan then it could be run per-gather

@danielnelson danielnelson modified the milestones: 1.4.0, 1.3.0 Apr 20, 2017
Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebito91 @rickard-von-essen I went through this pull request as well as #2449 and wrote some suggestions and questions. This will hopefully help us get an idea of how much work is left and which pull requests to move forward with.

As I understand it, the two plugins gather different types of data, with this plugin gathering from the error counter log and #2449 gathering from the attribute table. So maybe it even makes sense to merge the plugins at least from a data model perspective.

Let's discuss what our next steps are before fixing the comments.


3. A corollary to the `include` set of disks is the `exclude` set. In this case,
the scan will still be run unless `include` is defined but will limit those to
query for based on the set in the `exclude`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file will need updated to follow the example readme a bit more closely, especially with respect to the a description of the output.

## NOTE: If you specify an include list, this will skip the smartctl --scan function
## and only collect for those you've requested (minus any exclusions).
include = ["/dev/bus/0 -d megaraid,24"]
exclude = ["/dev/sda -d scsi"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default should be include = [], you can comment out these lines with the default values and move examples into your comment above.

cmd = append(cmd, disk...)

data = Disk{Name: "empty"}
if out, err = exec.Command(s.SudoPath, cmd...).CombinedOutput(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sudo use needs to be optional, since there are many way to handle permissions. You can add a use_sudo field like we did in [fail2ban(https://github.com/influxdata/telegraf/blob/ca9cec2c84e7c8796c2e8a747d17d1ad86ce1ae6/plugins/inputs/fail2ban/README.md#configuration), or it might be more readable and extensible to have something like ansible: become_method = "sudo"

Should probably run sudo with -n to avoid interactive prompt if not configured correctly.

@@ -0,0 +1,447 @@
// +build linux
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't bother to build only on linux, it is probably possible to run this or something similar on another platform.

From my experience with build systems, the good ones set build flags based on available/desired features and not by platform, and unfortunately we don't have a good method to provide this today.

s.DiskFailed[err.Name] = err.Error
a++
default:
time.Sleep(50 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

return fmt.Errorf("could not parse all the disks in our list: %v\n", err)
}

for _, each := range s.DiskOutput {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename each -> disk

fields["max_temp"] = each.Stats.MaxTemp

// add the read error row
if len(each.Stats.ReadError) == 7 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the different Stats, it seems to me that if the length is wrong we should report an error.

@@ -0,0 +1,375 @@
// +build linux
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the tests that run as part of -short, they should not require sudo or smartctl to pass. For the full tests, it would be okay but they should pass on any system with these utilities, so they can't really depend on output. Probably best to make them all not require sudo/smartctl.


// ParseStringSlice is a generic function that takes a given regexp and applies to a buf, placing
// output into the dataVar
func (s *SmartCtl) ParseStringSlice(regexp *regexp.Regexp, buf *bytes.Buffer, dataVar *[]string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use return values for output on the Parse* functions


# Sample Config
```
## smartctl requires installation of the smartmontools for your distro (linux only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is optional but it would be a nice touch to add the ability to override the smartctl location. smartctl_path = /usr/local/bin/smartctl.

@sebito91
Copy link
Contributor Author

sebito91 commented Aug 7, 2017

@danielnelson, @rickard-von-essen do you want to just merge these into #2449 and close out? Happy to just go with one solution, but leave it to you to determine which one.

@rickard-von-essen
Copy link
Contributor

@sebito91 I can try to bring in stuff from this into #2449 the coming days. (I'm planning on adding similar data as this and make reporting of attributes optional). Does that sound good to you?

@sebito91
Copy link
Contributor Author

sebito91 commented Aug 8, 2017

That sounds great by me, thanks @rickard-von-essen. Since you've got the thumbs-up from active use in the community, would be best to merge that way.

Let me know if you have any questions, otherwise I'll close this one for now.

@sebito91 sebito91 closed this Aug 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants