add indicesstats and shardstats to ES metrics #2518

MatthewOHaraTR · 2017-03-08T20:54:57Z

Required for all PRs:

CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
[X ] Sign CLA (if not already signed)
README.md updated (if adding a new plugin)

@sparrc
Here's the (hopefully) rebased version of my ES plugin changes

fixes #1956

Typo

primarily for a fix in Windows network counter getting code closes #1949

I think this is a copy paste bug? ;-)

closes #1564 also add unit and benchmark tests

Also don't use named returns in fetchNamespaceMetrics since it's non-standard for the rest of the codebase.

- fully document aggregator and processor plugins - improve readme.md closes #1989

closes #2023

* patching udp_listener for fun updating with errcode adding debug flags to temp msgs moving from debug to info * updating PR 1883 based on feedback

* Cache and expire metrics for prometheus output * Fix test * Use interval.Duration * Default prometheus expiration interval to 60s * Update changelog

) * added connection Timeout parámeter, basic HTTP autentication and HTTP support with Sslskipverify option * updated README.md * added optional SSL config , changed timeout name and type , and other minor fixes * added some code style improvements * Update README.md

* Trim null characters in Value data format Some producers (such as the paho embedded c mqtt client) add a null character "\x00" to the end of a message. The Value parser would fail on any message from such a producer. * Trim whitespace and null in all Value data formats * No unnecessary reassignments in Value data format parser * Update change log for Value data format fix

* Export IopsInProgress * Export IopsInProgress * Export IopsInProgress

#2029)

* [plugins] rabbitmq input plugin: add non default http timeouts * update CHANGELOG.md

The old gonuts fork has no License and has not seen any commits differing from the original project, while the original has seen some activity, even if low. Having no license is a problem for distributors, as by default, such projects are undistributable.

MatthewOHaraTR · 2017-03-27T17:37:39Z

@danielnelson - I suspect you'll need to catch up on this, but there are a couple (possibly 3 now) updates to the elasticsearch plugin for consideration. Please let me know whether you need changes to mine before acceptance of the PR.

MatthewOHaraTR · 2017-04-03T17:54:16Z

@danielnelson - Do you know when you'll have a chance to look at this?

danielnelson · 2017-04-03T18:05:48Z

@mhohara Hoping to take a look at several ES tasks soon, but don't want to lock myself in with a date :)

vbeskrovnov · 2017-04-12T17:37:09Z

plugins/inputs/elasticsearch/elasticsearch.go

-			e.isMaster = (id == e.catMasterResponseTokens[0])
-		}
+		// check for master
+		e.localNodeIsMaster = (id == e.masterNodeId)


Hi, this part of code executed in cycle for every node in cluster, so information only about last node saved in e.localNodeIsMaster. Because of this, information about count of master and data nodes not always writing in elasticsearch_clusterstats_nodes. I describe it in #2650. Please tell me, if I'm wrong.

Yes, we need to fix this issue before we can merge this PR.

Sorry for the delay in response. My thought is that the master-only statistics related to this flag should be restricted to only work when local=true. Then only one node would be reported in the node-stats, and this flag would make sense in that context. This is actually the way we're configuring local here, which seems to work fine. If memory serves, I removed that restriction due to a comment against an earlier version. I'm thinking I should put that restriction back in. Would this approach work for you?

I think the real problem here is that since we run the gather functions concurrently on all servers there is a race condition as to which server sets the value on the struct. This is a current bug and should be resolved on a pull request separate from this one.

Perhaps, although presumably that shouldn't block this PR. In some regards though, as these master-only requests should only be coming from the master node, it's really not necessary to run with local=false. There's no need to have all nodes get data from all other nodes - there's no control over who's master, and why get all the redundant information. Or maybe I just don't understand the use case for local=false.

danielnelson

I didn't reviewed the entire PR. I have some concerns around maintaining backwards compatibility, we should only break it if we have a very good reason. Can you update the README with details about the new measurements and how they will look?

Also, #2650 looks like an important issue that we should fix before merging this change.

danielnelson · 2017-04-18T00:37:07Z

plugins/inputs/elasticsearch/README.md

+
+  ## Set shards_stats to true when you want to obtain shards stats from the Master node.
+  ## If set, then indices_stats is considered true as they are also provided with shard stats.
+  shards_stats = false


This is named shards_stats here and in telegraf.conf but in elasticsearch.go it is indices_shards_stats.

Agreed. I changed the elasticsearch.go to correspond with the others.

danielnelson · 2017-04-18T00:38:54Z

plugins/parsers/json/parser.go

@@ -156,7 +156,6 @@ func (f *JSONFlattener) FullFlattenJSON(
 		}
 	case nil:
 		// ignored types
-		fmt.Println("json parser ignoring " + fieldname)


This debug statement needs removed

danielnelson · 2017-04-18T00:40:27Z

plugins/inputs/elasticsearch/elasticsearch.go

@@ -17,71 +16,15 @@ import (
 	"strings"
 )

-// mask for masking username/password from error messages
-var mask = regexp.MustCompile(`https?:\/\/\S+:\S+@`)


Why is this removed?

Searching under Telegraf, I did not see anything using it, nor did I see something similar in other .go files. Did I miss something? I can certainly restore it if you'd like - let me know.

Ah - Cameron put in a change the same day which added this and its functionality. Presumably this would be fixed with a merge.

It might be best to rebase, we have also removed the errchan bits in favor of AddError.

Right, I saw that when I merged. There were only a few lines that were different so its not big deal. I can certainly submit these changes with my next commit.

I noticed those changes when I did the merge. It was only a couple of lines, so I'll just include them in my next commit.

danielnelson · 2017-04-18T01:23:52Z

plugins/inputs/elasticsearch/elasticsearch.go

 		if err != nil {
 			return err
 		}
-		acc.AddFields("elasticsearch_clusterstats_"+p, f.Fields, tags, now)
+		acc.AddFields("elasticsearch_clusterstats_"+name, f.Fields, map[string]string{"name": ""}, now)


Why are we adding a tag with no value?

Good question. This took me a while to figure out, and I'm not sure that my answer is correct. The old code had a structure to pull out the tags (node_name, cluster_name and status). However, I ran into a problem trying to represent these as Tags on a Grafana dashboard, but when I stopped making them tags, then they represented just fine (as Strings using single-stats). Its been too long now and I don't recall the specifics and the timing of this work. If you think it should be possible to represent them, as tags, using SingleStat - then I can certainly put them back as Tags and give it a try that way if you'd like. Also note that node_name went away between 2.3.3 and 2.4.2, so that part of the tags/structure would no longer be possible to define. Please let me know your thoughts.

danielnelson · 2017-04-18T01:34:23Z

plugins/inputs/elasticsearch/elasticsearch.go

-	NodeIP   string `json:"ip"`
-	NodeName string `json:"node"`
-}
-


Is there a reason these structs need to move? If not, please move them back so it is easier to see the differences and track changes.

The catMaster struct was not used in the previous version, so it could be removed. I moved the other structs down into the functions which referenced them to make their structure more visible in the context in which they're being used (why jump up and down in the text to see what you're working on). I can put them back if you'd prefer and its the normal coding standard, but in my mind this is cleaner and provides better structure.

Lets leave them up top if only to reduce the cognitive load reviewing the diffs.

danielnelson · 2017-04-18T02:02:31Z

plugins/inputs/elasticsearch/elasticsearch.go

@@ -309,46 +316,197 @@ func (e *Elasticsearch) gatherClusterHealth(url string, acc telegraf.Accumulator
 			"unassigned_shards":     health.UnassignedShards,
 		}
 		acc.AddFields(
-			"elasticsearch_indices",
+			"elasticsearch_cluster_health_indices",


This breaks compatibility, is it a required change?

I did consider this. However, this is the set of metrics for'clusterhealth' - calling it 'indices' was misleading and interfered with the naming for the new indices related information (elasticsearch_indicesstats_shards & elasticsearch_indicesstats_, elasticsearch_indicesstats_shards_primary, elasticsearch_indicesstats_shards_replica). It seemed better to rename it to indicate its true meaning more accurately and avoid the confusion of all the other indices types of metrics.

danielnelson · 2017-04-18T02:04:29Z

plugins/inputs/elasticsearch/elasticsearch.go

-			e.isMaster = (id == e.catMasterResponseTokens[0])
-		}
+		// check for master
+		e.localNodeIsMaster = (id == e.masterNodeId)


Yes, we need to fix this issue before we can merge this PR.

MatthewOHaraTR · 2017-05-03T16:10:11Z

@danielnelson Let me know what you think of my responses. I'm starting to work now on updating my grafana dashboard (and possibly this Telegraf plugin) to work with ES5.1.1. I may have more changes to submit when I'm done, or possibly questions on how to best address versioning changes in this environment.

Not required

danielnelson · 2017-05-03T22:10:03Z

Could you update the README with the new measurements? The details of the code can be hammered out later but lets first try to come to agreement on the measurements, tags, fields.

danielnelson · 2017-05-03T23:08:24Z

plugins/inputs/elasticsearch/elasticsearch.go

+			}
+
+			if e.IndicesStats && !e.ShardsStats && e.localNodeIsMaster {
+				e.gatherIndicesStats(s+"_all/_stats", acc, false)


How does /_all/_stats compare to /_stats/ and can you also link to some docs?

Here are the docs I'm looking at for /_stats/:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html

It seems reasonable to think that /_stats is the same as /_all/_stats. The _all part just specifies all indices for the index-name portion so it returns all index level stats (which is what I wanted). However, /_stats indicates that it returns high level aggregation as well, so I'm not sure whether that might be different (and hence not part of the metrics that the plugin is now gathering).

I think it is safe to assume that both endpoints return the same data, according to the description here: https://github.com/elastic/elasticsearch/blob/master/rest-api-spec/src/main/resources/rest-api-spec/api/indices.stats.json

@lpic10 Thanks for the reference. It certainly clearly shows the '_all' does give all indices. But whether the underlying code does something different between _all/_stats and _stats isn't really clear in this. Like I said though, it seems likely. And the worst that could happen would be that extra metrics might be received - which wouldn't be a bad thing. Other than that, is there a reason to prefer just _stats over _all/_stats? I can certainly make the change if desired.

@danielnelson I forgot to mention, the indices-stats was already linked in the Readme.

MatthewOHaraTR · 2017-05-04T14:46:37Z

Two issues with updating the README with all of the metrics. One - there are a ton of metrics already listed and a ton more with the new additions, making for a very lengthy read at best. Two - the metric specifics will vary based on the ES version. The current README originated 1.5 years ago before the ES2.2 release. It likely doesn't truly reflect whats in ES 2.3 exactly, much less whats in 2.4 or 5.1.

Pleasantly, Telegraf doesn't rely on the specific details, just some high level structure formats. So when ES changes things under the covers, whatever metrics are provided will simply and easily come out of the telegraf parser.

Keeping up with the changes with different versions metrics is a bit tougher in the Grafana dashboard seems a bit tougher though, since it relies on the specific names.

To me, these issues would make the README too long, highly maintenance prone and inherently inaccurate as it would essentially be (re)documenting ES metrics here in Telegraf. Unfortunately, ES itself doesn't seem to document their metrics in this detail either on the pages I've seen (those linked in the README), other than the Cluster Heath response.

I hesitate to put some sort of detailed metric summary in this document for these reasons.

In some regards, it seems better to remove the low level metric names, perhaps keeping the somewhat higher-level details of the data coming out for each option selected? I'd be happy to write something up along this line and send it to you if that seems feasible to you.

danielnelson · 2017-05-06T00:08:23Z

In general I don't like passthrough collection, the maintenance advantage is also a problem because without a human to ensure they are correct the output quality suffers. Once a field has been written the name or type cannot be changed without causing breaking changes. That said, sometimes it is appropriate and we should stick to the plugins existing structure.

For documentation at a bare minimum we need to document the measurements and tags. Don't remove the existing fields, I feel like they are still useful even if not exactly correct. Each measurement should have a description of what it represents, perhaps a link to it's upstream docs for fields. We could also add a note that indicates that the available fields depends on the ES version.

MatthewOHaraTR · 2017-05-08T17:05:33Z

@danielnelson. Yes, agreed - its a tough balance between documentation and maintenance. Is this README heading in the right direction for you (other than being renamed to .txt to allow being sent this way)?
README.txt

danielnelson · 2017-05-09T21:09:30Z

Add in the exact measurement names and tags. It might help to explicitly name the source url for each measurement.

MatthewOHaraTR · 2017-05-11T13:03:05Z

@danielnelson The exact endpoints are mentioned near the beginning of the README. As far as the exact measurement names, please see the attached files containing information about shard stats request, and the node stats. Some of it is repetitive, but there are tons of metrics here.
nodestatslocal-51.txt
shardstats-51.txt

Are you sure you want all this information in there?

danielnelson · 2017-05-11T17:56:11Z

The measurement just refers to the name of the telegraf.Metric. You don't need to list all the fields.

MatthewOHaraTR · 2017-05-17T13:49:08Z

@danielnelson

Sorry for the delay. Here's the next attempt at the README.
README.txt

danielnelson · 2017-05-20T01:43:57Z

Looks good, can you add it to the PR? Remove any whitespace changes so its easier to see the changes.

I would like both #2650 and #2711 to be completed as separate pull requests before we continue here.

MatthewOHaraTR · 2017-05-22T15:58:28Z

@danielnelson Apparently I just blew my branch up. Tons of git merge issues with files unrelated to my changes, I'm unsure what triggered all of these. I don't have alot of experience with github - I'm wondering whether it might not be better to just start start over with a new fork/branch and pull the 4-6 files that I've changed into it & try another PR?

danielnelson · 2017-05-22T18:02:24Z

You can start over pretty easily with a force push and you won't have to reopen a PR. Goes something like this:

# rename this branch for backup:
git branch -m add_indices_and_shardstats_to_elasticsearch_metrics add_indices_and_shardstats_to_elasticsearch_metrics_backup
# start over:
git checkout add_indices_and_shardstats_to_elasticsearch_metrics master
# make changes
# force push to replace branch on remote:
git push origin add_indices_and_shardstats_to_elasticsearch_metrics -f

MatthewOHaraTR · 2017-05-23T19:41:01Z

@danielnelson This is what I tried - its much like what you had. I still seem to have a problem with the checkin though.

rename this branch for backup:

git branch -m add_indices_and_shardstats_to_elasticsearch_metrics add_indices_and_shardstats_to_elasticsearch_metrics_backup

start over:

git checkout –B add_indices_and_shardstats_to_elasticsearch_metrics master

make changes

Do the commit:

git commit –a –m “add Indices and Shard Stats to ES input plugin”

force push to replace branch on remote:

git push myFork add_indices_and_shardstats_to_elasticsearch_metrics -f

MatthewOHaraTR · 2017-05-30T21:39:35Z

@danielnelson This fork seems pretty hosed. I've tried a few more things - but it appears I have unresolved merge issues from 27 days ago that I can't untangle. Unless you have other suggestions/shortcuts, I don't see any way around deleting this fork and making a fresh new one - and then copying my changes into it, redoing the PR.

danielnelson · 2017-05-30T21:55:33Z

Okay

danielnelson · 2017-05-31T20:40:20Z

Continued on #2872

sparrc and others added 30 commits March 8, 2017 14:39

nats_consumer: buffer incoming messages

cdad241

fixes #1956

Update README.md (#1963)

d85e98e

Typo

fix leap_status value in chrony input plugin (#1983)

83e4831

Update etc/telegraf.conf

20611eb

Add release 1.2 section to changelog

741e25d

Use short commit in Makefile build

6e609a0

CircleCI script, do not explicitly set version tag

d4c46a4

Update gopsutil dependency

71fb9e7

primarily for a fix in Windows network counter getting code closes #1949

Update README.md (#1868)

929cc1c

I think this is a copy paste bug? ;-)

Use rfc3339 timestamps in telegraf log output

4bae29f

closes #1564 also add unit and benchmark tests

Fix up AWS plugin docs so they don't use single quotes. (#1991)

9e668a6

Also don't use named returns in fetchNamespaceMetrics since it's non-standard for the rest of the codebase.

Update etc/telegraf.conf

7ad07f1

Update docs on Cloudwatch. Set default period to 5m. (#2000)

3fde534

Documentation improvements

d086e05

- fully document aggregator and processor plugins - improve readme.md closes #1989

Fix single quote parsing of TOML durations

c83fdd6

closes #2023

Add udp_buffer_size option to udp_listener (#1883)

5a4e27e

* patching udp_listener for fun updating with errcode adding debug flags to temp msgs moving from debug to info * updating PR 1883 based on feedback

Cache and expire metrics for prometheus output (#2016)

88f3e8d

* Cache and expire metrics for prometheus output * Fix test * Use interval.Duration * Default prometheus expiration interval to 60s * Update changelog

changelog update

58e60c8

Add support to parse JSON array. (#1965)

8d79c7b

Added IopsInProgress to diskio stats (#2037)

0696085

* Export IopsInProgress * Export IopsInProgress * Export IopsInProgress

Update win_pref_counter to include Processor Queue Length in examples. (

35b68c3

#2029)

Configurable RabbitMQ HTTP timeouts #1997 (#1998)

12cc43b

* [plugins] rabbitmq input plugin: add non default http timeouts * update CHANGELOG.md

Add Copy() function to Metric interface

78a9bf6

'discard' output plugin

a4d5557

Add benchmarks for metric parsing and creating

5c50afe

amqp precision is not used anymore

1b88698

Fix changelog for json parser. (#2100)

84055f6

vbeskrovnov reviewed Apr 13, 2017

View reviewed changes

danielnelson suggested changes Apr 18, 2017

View reviewed changes

Merge branch 'master' of https://github.com/influxdata/telegraf

1bbeee2

danielnelson mentioned this pull request May 3, 2017

Add missing Elasticsearch cluster & index health stats #2456

Closed

3 tasks

danielnelson added this to the 1.4.0 milestone May 3, 2017

danielnelson reviewed May 3, 2017

View reviewed changes

danielnelson added the area/elasticsearch label May 3, 2017

danielnelson mentioned this pull request May 8, 2017

[elastic plugin] cluster health index-level stats + index node stats are merged into one measurement #2711

Closed

add Indices and Shard Stats to ES input plugin

ebad558

danielnelson closed this May 31, 2017

add indicesstats and shardstats to ES metrics #2518

add indicesstats and shardstats to ES metrics #2518

Conversation

MatthewOHaraTR commented Mar 8, 2017

Required for all PRs:

MatthewOHaraTR commented Mar 27, 2017

MatthewOHaraTR commented Apr 3, 2017

danielnelson commented Apr 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danielnelson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MatthewOHaraTR commented May 3, 2017

danielnelson commented May 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MatthewOHaraTR commented May 4, 2017

danielnelson commented May 6, 2017

MatthewOHaraTR commented May 8, 2017

danielnelson commented May 9, 2017

MatthewOHaraTR commented May 11, 2017

danielnelson commented May 11, 2017

MatthewOHaraTR commented May 17, 2017 • edited Loading

danielnelson commented May 20, 2017

MatthewOHaraTR commented May 22, 2017

danielnelson commented May 22, 2017

MatthewOHaraTR commented May 23, 2017

rename this branch for backup:

start over:

make changes

Do the commit:

force push to replace branch on remote:

MatthewOHaraTR commented May 30, 2017

danielnelson commented May 30, 2017

danielnelson commented May 31, 2017

MatthewOHaraTR commented May 17, 2017 •

edited

Loading