-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add indicesstats and shardstats to ES metrics #2518
add indicesstats and shardstats to ES metrics #2518
Conversation
primarily for a fix in Windows network counter getting code closes #1949
I think this is a copy paste bug? ;-)
closes #1564 also add unit and benchmark tests
Also don't use named returns in fetchNamespaceMetrics since it's non-standard for the rest of the codebase.
- fully document aggregator and processor plugins - improve readme.md closes #1989
* patching udp_listener for fun updating with errcode adding debug flags to temp msgs moving from debug to info * updating PR 1883 based on feedback
* Cache and expire metrics for prometheus output * Fix test * Use interval.Duration * Default prometheus expiration interval to 60s * Update changelog
* Trim null characters in Value data format Some producers (such as the paho embedded c mqtt client) add a null character "\x00" to the end of a message. The Value parser would fail on any message from such a producer. * Trim whitespace and null in all Value data formats * No unnecessary reassignments in Value data format parser * Update change log for Value data format fix
* Export IopsInProgress * Export IopsInProgress * Export IopsInProgress
The old gonuts fork has no License and has not seen any commits differing from the original project, while the original has seen some activity, even if low. Having no license is a problem for distributors, as by default, such projects are undistributable.
@danielnelson - I suspect you'll need to catch up on this, but there are a couple (possibly 3 now) updates to the elasticsearch plugin for consideration. Please let me know whether you need changes to mine before acceptance of the PR. |
@danielnelson - Do you know when you'll have a chance to look at this? |
@mhohara Hoping to take a look at several ES tasks soon, but don't want to lock myself in with a date :) |
e.isMaster = (id == e.catMasterResponseTokens[0]) | ||
} | ||
// check for master | ||
e.localNodeIsMaster = (id == e.masterNodeId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this part of code executed in cycle for every node in cluster, so information only about last node saved in e.localNodeIsMaster
. Because of this, information about count of master and data nodes not always writing in elasticsearch_clusterstats_nodes
. I describe it in #2650. Please tell me, if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to fix this issue before we can merge this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay in response. My thought is that the master-only statistics related to this flag should be restricted to only work when local=true. Then only one node would be reported in the node-stats, and this flag would make sense in that context. This is actually the way we're configuring local here, which seems to work fine. If memory serves, I removed that restriction due to a comment against an earlier version. I'm thinking I should put that restriction back in. Would this approach work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the real problem here is that since we run the gather functions concurrently on all servers there is a race condition as to which server sets the value on the struct. This is a current bug and should be resolved on a pull request separate from this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, although presumably that shouldn't block this PR. In some regards though, as these master-only requests should only be coming from the master node, it's really not necessary to run with local=false. There's no need to have all nodes get data from all other nodes - there's no control over who's master, and why get all the redundant information. Or maybe I just don't understand the use case for local=false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't reviewed the entire PR. I have some concerns around maintaining backwards compatibility, we should only break it if we have a very good reason. Can you update the README with details about the new measurements and how they will look?
Also, #2650 looks like an important issue that we should fix before merging this change.
|
||
## Set shards_stats to true when you want to obtain shards stats from the Master node. | ||
## If set, then indices_stats is considered true as they are also provided with shard stats. | ||
shards_stats = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is named shards_stats
here and in telegraf.conf but in elasticsearch.go it is indices_shards_stats
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I changed the elasticsearch.go to correspond with the others.
@@ -156,7 +156,6 @@ func (f *JSONFlattener) FullFlattenJSON( | |||
} | |||
case nil: | |||
// ignored types | |||
fmt.Println("json parser ignoring " + fieldname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This debug statement needs removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
@@ -17,71 +16,15 @@ import ( | |||
"strings" | |||
) | |||
|
|||
// mask for masking username/password from error messages | |||
var mask = regexp.MustCompile(`https?:\/\/\S+:\S+@`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Searching under Telegraf, I did not see anything using it, nor did I see something similar in other .go files. Did I miss something? I can certainly restore it if you'd like - let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah - Cameron put in a change the same day which added this and its functionality. Presumably this would be fixed with a merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be best to rebase, we have also removed the errchan bits in favor of AddError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I saw that when I merged. There were only a few lines that were different so its not big deal. I can certainly submit these changes with my next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed those changes when I did the merge. It was only a couple of lines, so I'll just include them in my next commit.
if err != nil { | ||
return err | ||
} | ||
acc.AddFields("elasticsearch_clusterstats_"+p, f.Fields, tags, now) | ||
acc.AddFields("elasticsearch_clusterstats_"+name, f.Fields, map[string]string{"name": ""}, now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we adding a tag with no value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This took me a while to figure out, and I'm not sure that my answer is correct. The old code had a structure to pull out the tags (node_name, cluster_name and status). However, I ran into a problem trying to represent these as Tags on a Grafana dashboard, but when I stopped making them tags, then they represented just fine (as Strings using single-stats). Its been too long now and I don't recall the specifics and the timing of this work. If you think it should be possible to represent them, as tags, using SingleStat - then I can certainly put them back as Tags and give it a try that way if you'd like. Also note that node_name went away between 2.3.3 and 2.4.2, so that part of the tags/structure would no longer be possible to define. Please let me know your thoughts.
NodeIP string `json:"ip"` | ||
NodeName string `json:"node"` | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason these structs need to move? If not, please move them back so it is easier to see the differences and track changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The catMaster struct was not used in the previous version, so it could be removed. I moved the other structs down into the functions which referenced them to make their structure more visible in the context in which they're being used (why jump up and down in the text to see what you're working on). I can put them back if you'd prefer and its the normal coding standard, but in my mind this is cleaner and provides better structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets leave them up top if only to reduce the cognitive load reviewing the diffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly.
@@ -309,46 +316,197 @@ func (e *Elasticsearch) gatherClusterHealth(url string, acc telegraf.Accumulator | |||
"unassigned_shards": health.UnassignedShards, | |||
} | |||
acc.AddFields( | |||
"elasticsearch_indices", | |||
"elasticsearch_cluster_health_indices", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks compatibility, is it a required change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider this. However, this is the set of metrics for'clusterhealth' - calling it 'indices' was misleading and interfered with the naming for the new indices related information (elasticsearch_indicesstats_shards & elasticsearch_indicesstats_, elasticsearch_indicesstats_shards_primary, elasticsearch_indicesstats_shards_replica). It seemed better to rename it to indicate its true meaning more accurately and avoid the confusion of all the other indices types of metrics.
e.isMaster = (id == e.catMasterResponseTokens[0]) | ||
} | ||
// check for master | ||
e.localNodeIsMaster = (id == e.masterNodeId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need to fix this issue before we can merge this PR.
@danielnelson Let me know what you think of my responses. I'm starting to work now on updating my grafana dashboard (and possibly this Telegraf plugin) to work with ES5.1.1. I may have more changes to submit when I'm done, or possibly questions on how to best address versioning changes in this environment. |
Could you update the README with the new measurements? The details of the code can be hammered out later but lets first try to come to agreement on the measurements, tags, fields. |
} | ||
|
||
if e.IndicesStats && !e.ShardsStats && e.localNodeIsMaster { | ||
e.gatherIndicesStats(s+"_all/_stats", acc, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does /_all/_stats
compare to /_stats/
and can you also link to some docs?
Here are the docs I'm looking at for /_stats/
:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems reasonable to think that /_stats is the same as /_all/_stats. The _all part just specifies all indices for the index-name portion so it returns all index level stats (which is what I wanted). However, /_stats indicates that it returns high level aggregation as well, so I'm not sure whether that might be different (and hence not part of the metrics that the plugin is now gathering).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is safe to assume that both endpoints return the same data, according to the description here: https://github.com/elastic/elasticsearch/blob/master/rest-api-spec/src/main/resources/rest-api-spec/api/indices.stats.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lpic10 Thanks for the reference. It certainly clearly shows the '_all' does give all indices. But whether the underlying code does something different between _all/_stats and _stats isn't really clear in this. Like I said though, it seems likely. And the worst that could happen would be that extra metrics might be received - which wouldn't be a bad thing. Other than that, is there a reason to prefer just _stats over _all/_stats? I can certainly make the change if desired.
@danielnelson I forgot to mention, the indices-stats was already linked in the Readme.
Two issues with updating the README with all of the metrics. One - there are a ton of metrics already listed and a ton more with the new additions, making for a very lengthy read at best. Two - the metric specifics will vary based on the ES version. The current README originated 1.5 years ago before the ES2.2 release. It likely doesn't truly reflect whats in ES 2.3 exactly, much less whats in 2.4 or 5.1. Pleasantly, Telegraf doesn't rely on the specific details, just some high level structure formats. So when ES changes things under the covers, whatever metrics are provided will simply and easily come out of the telegraf parser. Keeping up with the changes with different versions metrics is a bit tougher in the Grafana dashboard seems a bit tougher though, since it relies on the specific names. To me, these issues would make the README too long, highly maintenance prone and inherently inaccurate as it would essentially be (re)documenting ES metrics here in Telegraf. Unfortunately, ES itself doesn't seem to document their metrics in this detail either on the pages I've seen (those linked in the README), other than the Cluster Heath response. I hesitate to put some sort of detailed metric summary in this document for these reasons. In some regards, it seems better to remove the low level metric names, perhaps keeping the somewhat higher-level details of the data coming out for each option selected? I'd be happy to write something up along this line and send it to you if that seems feasible to you. |
In general I don't like passthrough collection, the maintenance advantage is also a problem because without a human to ensure they are correct the output quality suffers. Once a field has been written the name or type cannot be changed without causing breaking changes. That said, sometimes it is appropriate and we should stick to the plugins existing structure. For documentation at a bare minimum we need to document the measurements and tags. Don't remove the existing fields, I feel like they are still useful even if not exactly correct. Each measurement should have a description of what it represents, perhaps a link to it's upstream docs for fields. We could also add a note that indicates that the available fields depends on the ES version. |
@danielnelson. Yes, agreed - its a tough balance between documentation and maintenance. Is this README heading in the right direction for you (other than being renamed to .txt to allow being sent this way)? |
Add in the exact measurement names and tags. It might help to explicitly name the source url for each measurement. |
@danielnelson The exact endpoints are mentioned near the beginning of the README. As far as the exact measurement names, please see the attached files containing information about shard stats request, and the node stats. Some of it is repetitive, but there are tons of metrics here. Are you sure you want all this information in there? |
The measurement just refers to the name of the |
Sorry for the delay. Here's the next attempt at the README. |
@danielnelson Apparently I just blew my branch up. Tons of git merge issues with files unrelated to my changes, I'm unsure what triggered all of these. I don't have alot of experience with github - I'm wondering whether it might not be better to just start start over with a new fork/branch and pull the 4-6 files that I've changed into it & try another PR? |
You can start over pretty easily with a force push and you won't have to reopen a PR. Goes something like this:
|
@danielnelson This is what I tried - its much like what you had. I still seem to have a problem with the checkin though. rename this branch for backup:git branch -m add_indices_and_shardstats_to_elasticsearch_metrics add_indices_and_shardstats_to_elasticsearch_metrics_backup start over:git checkout –B add_indices_and_shardstats_to_elasticsearch_metrics master make changesDo the commit:git commit –a –m “add Indices and Shard Stats to ES input plugin” force push to replace branch on remote:git push myFork add_indices_and_shardstats_to_elasticsearch_metrics -f |
@danielnelson This fork seems pretty hosed. I've tried a few more things - but it appears I have unresolved merge issues from 27 days ago that I can't untangle. Unless you have other suggestions/shortcuts, I don't see any way around deleting this fork and making a fresh new one - and then copying my changes into it, redoing the PR. |
Okay |
Continued on #2872 |
Required for all PRs:
@sparrc
Here's the (hopefully) rebased version of my ES plugin changes