-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring] Make unassigned replica shard documents unique #91153
Conversation
@jbaiera I found this bug in the External collection module but noticed that the same issue exists in the Internal collection code so this is a (potentially naive) attempt to fix it. Can you give me some feedback on if this is a reasonable direction, if there are any steps I've missed and how to tag this PR properly before opening it for review? Thanks! |
Pinging @elastic/es-data-management (Team:Data Management) |
## Summary This PR depends on Metricbeat changes introduced in this PR elastic/beats#33457 or the Internal collection changes introduced in elastic/elasticsearch#91153. The above PR fixes a bug that makes Metricbeat properly report on all the shards in an index, but our shard legend had the same kind of bug, it only displayed the first replica because it didn't account for the replica "id" in the function that tries to deduplicate the results. Since the above PR ensures we report each shard with a unique document ID, we change the Kibana code to use the document ID instead of trying to regenerate a unique ID. ### How to test Get the changes: `git fetch [email protected]:miltonhultgren/kibana.git sm-shard-allocations:sm-shard-allocations && git switch sm-shard-allocations` Start Elasticsearch, create an index like this: ``` PUT /some-index { "settings": { "index": { "number_of_shards": 3, "number_of_replicas": 3 } } } ``` Run Metricbeat with `xpack.enabled: true`. Visit the Index details page for `some-index` in the Stack Monitoring app and verify that on a single node cluster you have 12 shards, 9 of which are unassigned. <img width="753" alt="Screenshot 2022-10-25 at 17 19 25" src="https://user-images.githubusercontent.com/2564140/197819368-8ea45e1c-7472-4e15-9267-3b5d73378f2a.png"> Co-authored-by: Kibana Machine <[email protected]>
…ic#143963) ## Summary This PR depends on Metricbeat changes introduced in this PR elastic/beats#33457 or the Internal collection changes introduced in elastic/elasticsearch#91153. The above PR fixes a bug that makes Metricbeat properly report on all the shards in an index, but our shard legend had the same kind of bug, it only displayed the first replica because it didn't account for the replica "id" in the function that tries to deduplicate the results. Since the above PR ensures we report each shard with a unique document ID, we change the Kibana code to use the document ID instead of trying to regenerate a unique ID. ### How to test Get the changes: `git fetch [email protected]:miltonhultgren/kibana.git sm-shard-allocations:sm-shard-allocations && git switch sm-shard-allocations` Start Elasticsearch, create an index like this: ``` PUT /some-index { "settings": { "index": { "number_of_shards": 3, "number_of_replicas": 3 } } } ``` Run Metricbeat with `xpack.enabled: true`. Visit the Index details page for `some-index` in the Stack Monitoring app and verify that on a single node cluster you have 12 shards, 9 of which are unassigned. <img width="753" alt="Screenshot 2022-10-25 at 17 19 25" src="https://user-images.githubusercontent.com/2564140/197819368-8ea45e1c-7472-4e15-9267-3b5d73378f2a.png"> Co-authored-by: Kibana Machine <[email protected]> (cherry picked from commit b7debc8)
…143963) (#145520) # Backport This will backport the following commits from `main` to `8.6`: - [[Stack Monitoring] Use doc ID to deduplicate shard allocations (#143963)](#143963) <!--- Backport version: 8.9.7 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Milton Hultgren","email":"[email protected]"},"sourceCommit":{"committedDate":"2022-11-17T09:44:58Z","message":"[Stack Monitoring] Use doc ID to deduplicate shard allocations (#143963)\n\n## Summary\r\n\r\nThis PR depends on Metricbeat changes introduced in this PR\r\nhttps://github.com/elastic/beats/pull/33457 or the Internal collection\r\nchanges introduced in\r\nhttps://github.com/elastic/elasticsearch/pull/91153.\r\n\r\nThe above PR fixes a bug that makes Metricbeat properly report on all\r\nthe shards in an index, but our shard legend had the same kind of bug,\r\nit only displayed the first replica because it didn't account for the\r\nreplica \"id\" in the function that tries to deduplicate the results.\r\n\r\nSince the above PR ensures we report each shard with a unique document\r\nID, we change the Kibana code to use the document ID instead of trying\r\nto regenerate a unique ID.\r\n\r\n### How to test\r\nGet the changes: `git fetch [email protected]:miltonhultgren/kibana.git\r\nsm-shard-allocations:sm-shard-allocations && git switch\r\nsm-shard-allocations`\r\n\r\nStart Elasticsearch, create an index like this:\r\n```\r\nPUT /some-index\r\n{\r\n \"settings\": {\r\n \"index\": {\r\n \"number_of_shards\": 3, \r\n \"number_of_replicas\": 3\r\n }\r\n }\r\n}\r\n```\r\n\r\nRun Metricbeat with `xpack.enabled: true`. \r\n\r\nVisit the Index details page for `some-index` in the Stack Monitoring\r\napp and verify that on a single node cluster you have 12 shards, 9 of\r\nwhich are unassigned.\r\n\r\n<img width=\"753\" alt=\"Screenshot 2022-10-25 at 17 19 25\"\r\nsrc=\"https://user-images.githubusercontent.com/2564140/197819368-8ea45e1c-7472-4e15-9267-3b5d73378f2a.png\">\r\n\r\nCo-authored-by: Kibana Machine <[email protected]>","sha":"b7debc839ff6923923fe9a41355bd0318e04cdca","branchLabelMapping":{"^v8.7.0$":"main","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["Team:Infra Monitoring UI","release_note:skip","Feature:Stack Monitoring","backport:prev-minor","v8.7.0"],"number":143963,"url":"https://github.com/elastic/kibana/pull/143963","mergeCommit":{"message":"[Stack Monitoring] Use doc ID to deduplicate shard allocations (#143963)\n\n## Summary\r\n\r\nThis PR depends on Metricbeat changes introduced in this PR\r\nhttps://github.com/elastic/beats/pull/33457 or the Internal collection\r\nchanges introduced in\r\nhttps://github.com/elastic/elasticsearch/pull/91153.\r\n\r\nThe above PR fixes a bug that makes Metricbeat properly report on all\r\nthe shards in an index, but our shard legend had the same kind of bug,\r\nit only displayed the first replica because it didn't account for the\r\nreplica \"id\" in the function that tries to deduplicate the results.\r\n\r\nSince the above PR ensures we report each shard with a unique document\r\nID, we change the Kibana code to use the document ID instead of trying\r\nto regenerate a unique ID.\r\n\r\n### How to test\r\nGet the changes: `git fetch [email protected]:miltonhultgren/kibana.git\r\nsm-shard-allocations:sm-shard-allocations && git switch\r\nsm-shard-allocations`\r\n\r\nStart Elasticsearch, create an index like this:\r\n```\r\nPUT /some-index\r\n{\r\n \"settings\": {\r\n \"index\": {\r\n \"number_of_shards\": 3, \r\n \"number_of_replicas\": 3\r\n }\r\n }\r\n}\r\n```\r\n\r\nRun Metricbeat with `xpack.enabled: true`. \r\n\r\nVisit the Index details page for `some-index` in the Stack Monitoring\r\napp and verify that on a single node cluster you have 12 shards, 9 of\r\nwhich are unassigned.\r\n\r\n<img width=\"753\" alt=\"Screenshot 2022-10-25 at 17 19 25\"\r\nsrc=\"https://user-images.githubusercontent.com/2564140/197819368-8ea45e1c-7472-4e15-9267-3b5d73378f2a.png\">\r\n\r\nCo-authored-by: Kibana Machine <[email protected]>","sha":"b7debc839ff6923923fe9a41355bd0318e04cdca"}},"sourceBranch":"main","suggestedTargetBranches":[],"targetPullRequestStates":[{"branch":"main","label":"v8.7.0","labelRegex":"^v8.7.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/143963","number":143963,"mergeCommit":{"message":"[Stack Monitoring] Use doc ID to deduplicate shard allocations (#143963)\n\n## Summary\r\n\r\nThis PR depends on Metricbeat changes introduced in this PR\r\nhttps://github.com/elastic/beats/pull/33457 or the Internal collection\r\nchanges introduced in\r\nhttps://github.com/elastic/elasticsearch/pull/91153.\r\n\r\nThe above PR fixes a bug that makes Metricbeat properly report on all\r\nthe shards in an index, but our shard legend had the same kind of bug,\r\nit only displayed the first replica because it didn't account for the\r\nreplica \"id\" in the function that tries to deduplicate the results.\r\n\r\nSince the above PR ensures we report each shard with a unique document\r\nID, we change the Kibana code to use the document ID instead of trying\r\nto regenerate a unique ID.\r\n\r\n### How to test\r\nGet the changes: `git fetch [email protected]:miltonhultgren/kibana.git\r\nsm-shard-allocations:sm-shard-allocations && git switch\r\nsm-shard-allocations`\r\n\r\nStart Elasticsearch, create an index like this:\r\n```\r\nPUT /some-index\r\n{\r\n \"settings\": {\r\n \"index\": {\r\n \"number_of_shards\": 3, \r\n \"number_of_replicas\": 3\r\n }\r\n }\r\n}\r\n```\r\n\r\nRun Metricbeat with `xpack.enabled: true`. \r\n\r\nVisit the Index details page for `some-index` in the Stack Monitoring\r\napp and verify that on a single node cluster you have 12 shards, 9 of\r\nwhich are unassigned.\r\n\r\n<img width=\"753\" alt=\"Screenshot 2022-10-25 at 17 19 25\"\r\nsrc=\"https://user-images.githubusercontent.com/2564140/197819368-8ea45e1c-7472-4e15-9267-3b5d73378f2a.png\">\r\n\r\nCo-authored-by: Kibana Machine <[email protected]>","sha":"b7debc839ff6923923fe9a41355bd0318e04cdca"}}]}] BACKPORT--> Co-authored-by: Milton Hultgren <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this and for you patience on waiting for the review.
I left some comments on the looping logic. Interacting with the routing table code is like aiming a crossbow at your feet, so I suggested some safer logic to use when collecting the stats that will hopefully spare all of our toes undo harm.
...ng/src/main/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardMonitoringDoc.java
Outdated
Show resolved
Hide resolved
results.add(new ShardMonitoringDoc(clusterUuid, timestamp, interval, shardNode, shard, stateUUID)); | ||
for (String index : indices) { | ||
int shardCount = 0; | ||
for (ShardRouting shard : routingTable.allShards(index)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we removed the null check above because we are getting the indices from the routing table by name now.
From my understanding of the code, the indices we are monitoring can be specified via a cluster setting. This means that they are under no obligation to actually exist in the cluster. If an index name that doesn't exist is present in that setting then this call to allShards will throw an IndexNotFoundException.
I think this new code is generally better since we don't need to iterate over every shard in the cluster, but we should avoid the exception if possible.
I think that is doable by calling routingTable.index(index)
and checking if the returned routing table is null. If it is null we should skip that index since it does not exist. If the routing table exists, then we can call the allShards
method and collect the stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue I faced while working on the feedback, is that the list of indices to monitor can include pattern blobs as well.
Is there an idiomatic way in the ES code base to expand an index pattern (like metricbeat-*
) to a list of indices that match that pattern based on the current routing table? I wrote some hand rolled code for doing this but I suspect this is a more common operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these patterns supported in the logic before? If they were not then I think we should just spin that out into a separate bug fix after this change goes in. If they were, then can we just use the existing logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were supported since we looped over every index and used the Regex.simpleMatch
to see if a pattern matched the index. In the code I wrote to expand patterns I use the same method, but I'm still able to not check every index.
.../src/test/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollectorTests.java
Show resolved
Hide resolved
int shardCount = 0; | ||
for (ShardRouting shard : routingTable.allShards(index)) { | ||
if (shard.primary()) { | ||
shardCount = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looping logic worries me.
Based on the routing table code, there's nothing to suggest that shards returned from it will always be ordered such that the primary is returned before the replicas. Instead, it looks like the order of the shards is based entirely on how the table is created, and that is dictated by code that lives far away from here.
If the ordering isn't correct, then this logic could lead to id collisions for shards. If the returned shards are ordered replica1, replica1, primary1, replica1
then you'll only get three distinct id's for those shard copies. The shard routings in the second and fourth slots will have the same counter value when the primary in slot 3 resets it. Furthermore, the next shard in the list will have its counters offset unless it has the primary sorted first.
Since we'll need to get the index routing table above to make sure it's not null before using it, let's build further on that. Let's get the number of shards from the index routing table with routingTable.size()
. Then let's do a for loop bound by the shard count, asking for the shard routing table for each shard number with routingTable.shard(idx)
. After that, we can ask each shard routing table for its primary shard and iterate over any replicas it has with safer counting logic. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a lot of sense to me.
I just wonder about one thing, the method IndexShardRoutingTable.replicaShards
returns a list of all replica shards but isn't the order of those replica shards also up to whatever code instantiates the IndexShardRoutingTable
, so they might also have a random ordering?
That's likely a much smaller problem than the skipping index code I had before with the primaries marking the reset point.
I'm not sure if it would be a big problem if the order of the replica shards changed in the monitoring docs?
@jbaiera Thanks for the review, I'm going to address your comments ASAP! |
@@ -46,28 +48,68 @@ protected Collection<MonitoringDoc> doCollect(final MonitoringDoc.Node node, fin | |||
if (clusterState != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has a lot of these nested levels which we could get rid of by flipping the guard statements, but I'm not sure what the idiomatic style is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is pretty deeply nested, but I don't think it's so bad that it needs a refactoring right now. A lot of this code is deprecated and on a course to being a distant memory when the internal monitoring feature is removed.
if (document.getShardRouting().primary()) { | ||
assertThat(document.getId(), equalTo(ShardMonitoringDoc.id(stateUUID, document.getShardRouting(), 0))); | ||
} else { | ||
assertThat(document.getId(), matchesPattern("^[\\w-]+:(_current|_na)+:_index:s\\d:r[1-9]+[0-9]*")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, I don't know which index is being used for which monitoring document, so I'm falling back to just checking the format and that at least there isn't any replica r0
(since the 0 would be the primary).
@@ -130,15 +144,60 @@ public void testDoCollect() throws Exception { | |||
private static RoutingTable mockRoutingTable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test setup here is starting to feel a bit messy, not sure what can be done about it.
@jbaiera I think I've addressed the comments you left, I'm afraid I won't have much more time to invest into this issue so I hope this is good enough or close to good enough! |
...oring/src/main/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollector.java
Show resolved
Hide resolved
...oring/src/main/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollector.java
Outdated
Show resolved
Hide resolved
@miltonhultgren Sorry for the review delay. This is very close indeed! Just a couple of small fixes and it should be good to go. If you find you aren't able to continue carrying it forward, you can set the PR to allow maintainers to make edits and I can finish the cleanup! Thanks again for iterating on it! |
@jbaiera Thanks for taking another look, I pushed a commit based on the last two comments! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks so much for iterating on this, especially over the holidays where feedback is easily delayed.
…91153) * [Monitoring] Make unassigned replica shard documents unique * Fix failing tests * Use IndexRoutingTable/IndexShardRoutingTable to resolve shard order * Use Random in test when shuffling * Use Set to avoid duplicates and inline variable
Summary
In the
shard
data set, we want to report one document per shard allocation status. If you have an index with 3 shards (primaries) and 3 replicas (per primary, 9 total) you should have 12 documents that describe if the shards are assigned or not.In
main
we grab all shard routings from the routing table of the cluster state, and generate a document ID based on thestate UUID, assigned node, index name, shard number and type (primary, replica) but this doesn't capture that there might be multiple replicas for a given primary shard.
In the 3 replica example above, we end up with only 2 documents, one for the primary and one for the first replica found in the routing table.
This PR adds a counter to the ID if it's a replica, making sure that we generate a document for each replica of the shard.
How to test
Get the changes:
git fetch [email protected]:miltonhultgren/elasticsearch.git sm-shard-allocations:sm-shard-allocations && git switch sm-shard-allocations
Start Elasticsearch with Internal collection enabled:
./gradlew run -Drun.license_type=trial -Dtests.es.xpack.monitoring.collection.enabled=true -Dtest.es.xpack.monitoring.exporters.id0.type=local
Create an index like this:
Run the following query to verify that you get 12 documents back, 3 for the assigned primaries and 9 for the unassigned replicas:
Such as:
Related External collection (Metricbeat/Agent) PR