Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add indicesstats and shardstats to ES metrics #2518

Closed
wants to merge 197 commits into from
Closed

add indicesstats and shardstats to ES metrics #2518

wants to merge 197 commits into from

Conversation

MatthewOHaraTR
Copy link
Contributor

Required for all PRs:

  • CHANGELOG.md updated (we recommend not updating this until the PR has been approved by a maintainer)
  • [X ] Sign CLA (if not already signed)
  • README.md updated (if adding a new plugin)

@sparrc
Here's the (hopefully) rebased version of my ES plugin changes

sparrc and others added 30 commits March 8, 2017 14:39
primarily for a fix in Windows network counter getting code

closes #1949
I think this is a copy paste bug? ;-)
closes #1564

also add unit and benchmark tests
Also don't use named returns in fetchNamespaceMetrics since it's
non-standard for the rest of the codebase.
- fully document aggregator and processor plugins
- improve readme.md

closes #1989
* patching udp_listener for fun

updating with errcode

adding debug flags to temp msgs

moving from debug to info

* updating PR 1883 based on feedback
* Cache and expire metrics for prometheus output

* Fix test

* Use interval.Duration

* Default prometheus expiration interval to 60s

* Update changelog
)

* added connection Timeout parámeter, basic HTTP autentication and HTTP support with Sslskipverify option

* updated README.md

* added optional SSL config , changed timeout name and type , and other minor fixes

* added some code style improvements

* Update README.md
* Trim null characters in Value data format

Some producers (such as the paho embedded c mqtt client) add a null
character "\x00" to the end of a message.  The Value parser would fail on
any message from such a producer.

* Trim whitespace and null in all Value data formats

* No unnecessary reassignments in Value data format parser

* Update change log for Value data format fix
* Export IopsInProgress

* Export IopsInProgress

* Export IopsInProgress
* [plugins] rabbitmq input plugin: add non default http timeouts

* update CHANGELOG.md
The old gonuts fork has no License and has not seen any commits
differing from the original project, while the original has seen some
activity, even if low.

Having no license is a problem for distributors, as by default, such
projects are undistributable.
@MatthewOHaraTR
Copy link
Contributor Author

@danielnelson - I suspect you'll need to catch up on this, but there are a couple (possibly 3 now) updates to the elasticsearch plugin for consideration. Please let me know whether you need changes to mine before acceptance of the PR.

@MatthewOHaraTR
Copy link
Contributor Author

@danielnelson - Do you know when you'll have a chance to look at this?

@danielnelson
Copy link
Contributor

@mhohara Hoping to take a look at several ES tasks soon, but don't want to lock myself in with a date :)

e.isMaster = (id == e.catMasterResponseTokens[0])
}
// check for master
e.localNodeIsMaster = (id == e.masterNodeId)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this part of code executed in cycle for every node in cluster, so information only about last node saved in e.localNodeIsMaster. Because of this, information about count of master and data nodes not always writing in elasticsearch_clusterstats_nodes. I describe it in #2650. Please tell me, if I'm wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to fix this issue before we can merge this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in response. My thought is that the master-only statistics related to this flag should be restricted to only work when local=true. Then only one node would be reported in the node-stats, and this flag would make sense in that context. This is actually the way we're configuring local here, which seems to work fine. If memory serves, I removed that restriction due to a comment against an earlier version. I'm thinking I should put that restriction back in. Would this approach work for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the real problem here is that since we run the gather functions concurrently on all servers there is a race condition as to which server sets the value on the struct. This is a current bug and should be resolved on a pull request separate from this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, although presumably that shouldn't block this PR. In some regards though, as these master-only requests should only be coming from the master node, it's really not necessary to run with local=false. There's no need to have all nodes get data from all other nodes - there's no control over who's master, and why get all the redundant information. Or maybe I just don't understand the use case for local=false.

Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't reviewed the entire PR. I have some concerns around maintaining backwards compatibility, we should only break it if we have a very good reason. Can you update the README with details about the new measurements and how they will look?

Also, #2650 looks like an important issue that we should fix before merging this change.


## Set shards_stats to true when you want to obtain shards stats from the Master node.
## If set, then indices_stats is considered true as they are also provided with shard stats.
shards_stats = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is named shards_stats here and in telegraf.conf but in elasticsearch.go it is indices_shards_stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I changed the elasticsearch.go to correspond with the others.

@@ -156,7 +156,6 @@ func (f *JSONFlattener) FullFlattenJSON(
}
case nil:
// ignored types
fmt.Println("json parser ignoring " + fieldname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debug statement needs removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@@ -17,71 +16,15 @@ import (
"strings"
)

// mask for masking username/password from error messages
var mask = regexp.MustCompile(`https?:\/\/\S+:\S+@`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searching under Telegraf, I did not see anything using it, nor did I see something similar in other .go files. Did I miss something? I can certainly restore it if you'd like - let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - Cameron put in a change the same day which added this and its functionality. Presumably this would be fixed with a merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be best to rebase, we have also removed the errchan bits in favor of AddError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I saw that when I merged. There were only a few lines that were different so its not big deal. I can certainly submit these changes with my next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed those changes when I did the merge. It was only a couple of lines, so I'll just include them in my next commit.

if err != nil {
return err
}
acc.AddFields("elasticsearch_clusterstats_"+p, f.Fields, tags, now)
acc.AddFields("elasticsearch_clusterstats_"+name, f.Fields, map[string]string{"name": ""}, now)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding a tag with no value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This took me a while to figure out, and I'm not sure that my answer is correct. The old code had a structure to pull out the tags (node_name, cluster_name and status). However, I ran into a problem trying to represent these as Tags on a Grafana dashboard, but when I stopped making them tags, then they represented just fine (as Strings using single-stats). Its been too long now and I don't recall the specifics and the timing of this work. If you think it should be possible to represent them, as tags, using SingleStat - then I can certainly put them back as Tags and give it a try that way if you'd like. Also note that node_name went away between 2.3.3 and 2.4.2, so that part of the tags/structure would no longer be possible to define. Please let me know your thoughts.

NodeIP string `json:"ip"`
NodeName string `json:"node"`
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these structs need to move? If not, please move them back so it is easier to see the differences and track changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catMaster struct was not used in the previous version, so it could be removed. I moved the other structs down into the functions which referenced them to make their structure more visible in the context in which they're being used (why jump up and down in the text to see what you're working on). I can put them back if you'd prefer and its the normal coding standard, but in my mind this is cleaner and provides better structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets leave them up top if only to reduce the cognitive load reviewing the diffs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly.

@@ -309,46 +316,197 @@ func (e *Elasticsearch) gatherClusterHealth(url string, acc telegraf.Accumulator
"unassigned_shards": health.UnassignedShards,
}
acc.AddFields(
"elasticsearch_indices",
"elasticsearch_cluster_health_indices",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks compatibility, is it a required change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this. However, this is the set of metrics for'clusterhealth' - calling it 'indices' was misleading and interfered with the naming for the new indices related information (elasticsearch_indicesstats_shards & elasticsearch_indicesstats_, elasticsearch_indicesstats_shards_primary, elasticsearch_indicesstats_shards_replica). It seemed better to rename it to indicate its true meaning more accurately and avoid the confusion of all the other indices types of metrics.

e.isMaster = (id == e.catMasterResponseTokens[0])
}
// check for master
e.localNodeIsMaster = (id == e.masterNodeId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to fix this issue before we can merge this PR.

@MatthewOHaraTR
Copy link
Contributor Author

@danielnelson Let me know what you think of my responses. I'm starting to work now on updating my grafana dashboard (and possibly this Telegraf plugin) to work with ES5.1.1. I may have more changes to submit when I'm done, or possibly questions on how to best address versioning changes in this environment.

@danielnelson danielnelson dismissed sparrc’s stale review May 3, 2017 19:04

Not required

@danielnelson
Copy link
Contributor

Could you update the README with the new measurements? The details of the code can be hammered out later but lets first try to come to agreement on the measurements, tags, fields.

@danielnelson danielnelson added this to the 1.4.0 milestone May 3, 2017
}

if e.IndicesStats && !e.ShardsStats && e.localNodeIsMaster {
e.gatherIndicesStats(s+"_all/_stats", acc, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does /_all/_stats compare to /_stats/ and can you also link to some docs?

Here are the docs I'm looking at for /_stats/:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable to think that /_stats is the same as /_all/_stats. The _all part just specifies all indices for the index-name portion so it returns all index level stats (which is what I wanted). However, /_stats indicates that it returns high level aggregation as well, so I'm not sure whether that might be different (and hence not part of the metrics that the plugin is now gathering).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safe to assume that both endpoints return the same data, according to the description here: https://github.com/elastic/elasticsearch/blob/master/rest-api-spec/src/main/resources/rest-api-spec/api/indices.stats.json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lpic10 Thanks for the reference. It certainly clearly shows the '_all' does give all indices. But whether the underlying code does something different between _all/_stats and _stats isn't really clear in this. Like I said though, it seems likely. And the worst that could happen would be that extra metrics might be received - which wouldn't be a bad thing. Other than that, is there a reason to prefer just _stats over _all/_stats? I can certainly make the change if desired.

@danielnelson I forgot to mention, the indices-stats was already linked in the Readme.

@MatthewOHaraTR
Copy link
Contributor Author

Two issues with updating the README with all of the metrics. One - there are a ton of metrics already listed and a ton more with the new additions, making for a very lengthy read at best. Two - the metric specifics will vary based on the ES version. The current README originated 1.5 years ago before the ES2.2 release. It likely doesn't truly reflect whats in ES 2.3 exactly, much less whats in 2.4 or 5.1.

Pleasantly, Telegraf doesn't rely on the specific details, just some high level structure formats. So when ES changes things under the covers, whatever metrics are provided will simply and easily come out of the telegraf parser.

Keeping up with the changes with different versions metrics is a bit tougher in the Grafana dashboard seems a bit tougher though, since it relies on the specific names.

To me, these issues would make the README too long, highly maintenance prone and inherently inaccurate as it would essentially be (re)documenting ES metrics here in Telegraf. Unfortunately, ES itself doesn't seem to document their metrics in this detail either on the pages I've seen (those linked in the README), other than the Cluster Heath response.

I hesitate to put some sort of detailed metric summary in this document for these reasons.

In some regards, it seems better to remove the low level metric names, perhaps keeping the somewhat higher-level details of the data coming out for each option selected? I'd be happy to write something up along this line and send it to you if that seems feasible to you.

@danielnelson
Copy link
Contributor

In general I don't like passthrough collection, the maintenance advantage is also a problem because without a human to ensure they are correct the output quality suffers. Once a field has been written the name or type cannot be changed without causing breaking changes. That said, sometimes it is appropriate and we should stick to the plugins existing structure.

For documentation at a bare minimum we need to document the measurements and tags. Don't remove the existing fields, I feel like they are still useful even if not exactly correct. Each measurement should have a description of what it represents, perhaps a link to it's upstream docs for fields. We could also add a note that indicates that the available fields depends on the ES version.

@MatthewOHaraTR
Copy link
Contributor Author

@danielnelson. Yes, agreed - its a tough balance between documentation and maintenance. Is this README heading in the right direction for you (other than being renamed to .txt to allow being sent this way)?
README.txt

@danielnelson
Copy link
Contributor

Add in the exact measurement names and tags. It might help to explicitly name the source url for each measurement.

@MatthewOHaraTR
Copy link
Contributor Author

@danielnelson The exact endpoints are mentioned near the beginning of the README. As far as the exact measurement names, please see the attached files containing information about shard stats request, and the node stats. Some of it is repetitive, but there are tons of metrics here.
nodestatslocal-51.txt
shardstats-51.txt

Are you sure you want all this information in there?

@danielnelson
Copy link
Contributor

The measurement just refers to the name of the telegraf.Metric. You don't need to list all the fields.

@MatthewOHaraTR
Copy link
Contributor Author

MatthewOHaraTR commented May 17, 2017

@danielnelson

Sorry for the delay. Here's the next attempt at the README.
README.txt

@danielnelson
Copy link
Contributor

Looks good, can you add it to the PR? Remove any whitespace changes so its easier to see the changes.

I would like both #2650 and #2711 to be completed as separate pull requests before we continue here.

@MatthewOHaraTR
Copy link
Contributor Author

@danielnelson Apparently I just blew my branch up. Tons of git merge issues with files unrelated to my changes, I'm unsure what triggered all of these. I don't have alot of experience with github - I'm wondering whether it might not be better to just start start over with a new fork/branch and pull the 4-6 files that I've changed into it & try another PR?

@danielnelson
Copy link
Contributor

You can start over pretty easily with a force push and you won't have to reopen a PR. Goes something like this:

# rename this branch for backup:
git branch -m add_indices_and_shardstats_to_elasticsearch_metrics add_indices_and_shardstats_to_elasticsearch_metrics_backup
# start over:
git checkout add_indices_and_shardstats_to_elasticsearch_metrics master
# make changes
# force push to replace branch on remote:
git push origin add_indices_and_shardstats_to_elasticsearch_metrics -f

@MatthewOHaraTR
Copy link
Contributor Author

@danielnelson This is what I tried - its much like what you had. I still seem to have a problem with the checkin though.

rename this branch for backup:

git branch -m add_indices_and_shardstats_to_elasticsearch_metrics add_indices_and_shardstats_to_elasticsearch_metrics_backup

start over:

git checkout –B add_indices_and_shardstats_to_elasticsearch_metrics master

make changes

Do the commit:

git commit –a –m “add Indices and Shard Stats to ES input plugin”

force push to replace branch on remote:

git push myFork add_indices_and_shardstats_to_elasticsearch_metrics -f

@MatthewOHaraTR
Copy link
Contributor Author

@danielnelson This fork seems pretty hosed. I've tried a few more things - but it appears I have unresolved merge issues from 27 days ago that I can't untangle. Unless you have other suggestions/shortcuts, I don't see any way around deleting this fork and making a fresh new one - and then copying my changes into it, redoing the PR.

@danielnelson
Copy link
Contributor

Okay

@danielnelson
Copy link
Contributor

Continued on #2872

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.