[Monitoring] Make unassigned replica shard documents unique #91153

miltonhultgren · 2022-10-27T11:42:10Z

Summary

In the shard data set, we want to report one document per shard allocation status. If you have an index with 3 shards (primaries) and 3 replicas (per primary, 9 total) you should have 12 documents that describe if the shards are assigned or not.

In main we grab all shard routings from the routing table of the cluster state, and generate a document ID based on the
state UUID, assigned node, index name, shard number and type (primary, replica) but this doesn't capture that there might be multiple replicas for a given primary shard.

In the 3 replica example above, we end up with only 2 documents, one for the primary and one for the first replica found in the routing table.

This PR adds a counter to the ID if it's a replica, making sure that we generate a document for each replica of the shard.

How to test

Get the changes:
git fetch [email protected]:miltonhultgren/elasticsearch.git sm-shard-allocations:sm-shard-allocations && git switch sm-shard-allocations

Start Elasticsearch with Internal collection enabled:
./gradlew run -Drun.license_type=trial -Dtests.es.xpack.monitoring.collection.enabled=true -Dtest.es.xpack.monitoring.exporters.id0.type=local

Create an index like this:

PUT /some-index
{
  "settings": {
    "index": {
      "number_of_shards": 3,  
      "number_of_replicas": 3
    }
  }
}

Run the following query to verify that you get 12 documents back, 3 for the assigned primaries and 9 for the unassigned replicas:

GET .monitoring-*/_search
{
  "size": 20,
  "_source": [
    "_id"
  ],
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "type": "shards"
                }
              }
            ]
          }
        },
        {
          "bool": {
            "should": [
              {
                "term": {
                  "shard.index": "some-index"
                }
              }
            ]
          }
        }
      ]
    }
  }
}

Such as:

    [
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:YehnOz5cS2StYQ5C8ssG1Q:some-index:s0:p",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:_na:some-index:s0:r1",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:_na:some-index:s0:r2",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:_na:some-index:s0:r3",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:YehnOz5cS2StYQ5C8ssG1Q:some-index:s1:p",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:_na:some-index:s1:r1",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:_na:some-index:s1:r2",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:_na:some-index:s1:r3",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:YehnOz5cS2StYQ5C8ssG1Q:some-index:s2:p",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:_na:some-index:s2:r1",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:_na:some-index:s2:r2",
        "_score": 0,
        "_source": {}
      },
      {
        "_index": ".monitoring-es-7-2022.10.27",
        "_id": "TyApCSpKTZWwFWCqjmTxRg:_na:some-index:s2:r3",
        "_score": 0,
        "_source": {}
      }
    ]

Related External collection (Metricbeat/Agent) PR

miltonhultgren · 2022-10-27T11:43:55Z

@jbaiera I found this bug in the External collection module but noticed that the same issue exists in the Internal collection code so this is a (potentially naive) attempt to fix it.

Can you give me some feedback on if this is a reasonable direction, if there are any steps I've missed and how to tag this PR properly before opening it for review? Thanks!

…-allocations

elasticsearchmachine · 2022-11-10T16:58:50Z

Pinging @elastic/es-data-management (Team:Data Management)

## Summary This PR depends on Metricbeat changes introduced in this PR elastic/beats#33457 or the Internal collection changes introduced in elastic/elasticsearch#91153. The above PR fixes a bug that makes Metricbeat properly report on all the shards in an index, but our shard legend had the same kind of bug, it only displayed the first replica because it didn't account for the replica "id" in the function that tries to deduplicate the results. Since the above PR ensures we report each shard with a unique document ID, we change the Kibana code to use the document ID instead of trying to regenerate a unique ID. ### How to test Get the changes: `git fetch [email protected]:miltonhultgren/kibana.git sm-shard-allocations:sm-shard-allocations && git switch sm-shard-allocations` Start Elasticsearch, create an index like this: ``` PUT /some-index { "settings": { "index": { "number_of_shards": 3, "number_of_replicas": 3 } } } ``` Run Metricbeat with `xpack.enabled: true`. Visit the Index details page for `some-index` in the Stack Monitoring app and verify that on a single node cluster you have 12 shards, 9 of which are unassigned. <img width="753" alt="Screenshot 2022-10-25 at 17 19 25" src="https://user-images.githubusercontent.com/2564140/197819368-8ea45e1c-7472-4e15-9267-3b5d73378f2a.png"> Co-authored-by: Kibana Machine <[email protected]>

…ic#143963) ## Summary This PR depends on Metricbeat changes introduced in this PR elastic/beats#33457 or the Internal collection changes introduced in elastic/elasticsearch#91153. The above PR fixes a bug that makes Metricbeat properly report on all the shards in an index, but our shard legend had the same kind of bug, it only displayed the first replica because it didn't account for the replica "id" in the function that tries to deduplicate the results. Since the above PR ensures we report each shard with a unique document ID, we change the Kibana code to use the document ID instead of trying to regenerate a unique ID. ### How to test Get the changes: `git fetch [email protected]:miltonhultgren/kibana.git sm-shard-allocations:sm-shard-allocations && git switch sm-shard-allocations` Start Elasticsearch, create an index like this: ``` PUT /some-index { "settings": { "index": { "number_of_shards": 3, "number_of_replicas": 3 } } } ``` Run Metricbeat with `xpack.enabled: true`. Visit the Index details page for `some-index` in the Stack Monitoring app and verify that on a single node cluster you have 12 shards, 9 of which are unassigned. <img width="753" alt="Screenshot 2022-10-25 at 17 19 25" src="https://user-images.githubusercontent.com/2564140/197819368-8ea45e1c-7472-4e15-9267-3b5d73378f2a.png"> Co-authored-by: Kibana Machine <[email protected]> (cherry picked from commit b7debc8)

…143963) (#145520) # Backport This will backport the following commits from `main` to `8.6`: - [[Stack Monitoring] Use doc ID to deduplicate shard allocations (#143963)](#143963)  ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport)  Co-authored-by: Milton Hultgren <[email protected]>

jbaiera

Thanks for opening this and for you patience on waiting for the review.

I left some comments on the looping logic. Interacting with the routing table code is like aiming a crossbow at your feet, so I suggested some safer logic to use when collecting the stats that will hopefully spare all of our toes undo harm.

...ng/src/main/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardMonitoringDoc.java

jbaiera · 2022-11-17T19:38:31Z

...oring/src/main/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollector.java

-                            results.add(new ShardMonitoringDoc(clusterUuid, timestamp, interval, shardNode, shard, stateUUID));
+                for (String index : indices) {
+                    int shardCount = 0;
+                    for (ShardRouting shard : routingTable.allShards(index)) {


It looks like we removed the null check above because we are getting the indices from the routing table by name now.

From my understanding of the code, the indices we are monitoring can be specified via a cluster setting. This means that they are under no obligation to actually exist in the cluster. If an index name that doesn't exist is present in that setting then this call to allShards will throw an IndexNotFoundException.

I think this new code is generally better since we don't need to iterate over every shard in the cluster, but we should avoid the exception if possible.

I think that is doable by calling routingTable.index(index) and checking if the returned routing table is null. If it is null we should skip that index since it does not exist. If the routing table exists, then we can call the allShards method and collect the stats.

One issue I faced while working on the feedback, is that the list of indices to monitor can include pattern blobs as well.

Is there an idiomatic way in the ES code base to expand an index pattern (like metricbeat-*) to a list of indices that match that pattern based on the current routing table? I wrote some hand rolled code for doing this but I suspect this is a more common operation.

Were these patterns supported in the logic before? If they were not then I think we should just spin that out into a separate bug fix after this change goes in. If they were, then can we just use the existing logic?

They were supported since we looped over every index and used the Regex.simpleMatch to see if a pattern matched the index. In the code I wrote to expand patterns I use the same method, but I'm still able to not check every index.

.../src/test/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollectorTests.java

jbaiera · 2022-11-17T20:15:47Z

...oring/src/main/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollector.java

+                    int shardCount = 0;
+                    for (ShardRouting shard : routingTable.allShards(index)) {
+                        if (shard.primary()) {
+                            shardCount = 0;


This looping logic worries me.

Based on the routing table code, there's nothing to suggest that shards returned from it will always be ordered such that the primary is returned before the replicas. Instead, it looks like the order of the shards is based entirely on how the table is created, and that is dictated by code that lives far away from here.

If the ordering isn't correct, then this logic could lead to id collisions for shards. If the returned shards are ordered replica1, replica1, primary1, replica1 then you'll only get three distinct id's for those shard copies. The shard routings in the second and fourth slots will have the same counter value when the primary in slot 3 resets it. Furthermore, the next shard in the list will have its counters offset unless it has the primary sorted first.

Since we'll need to get the index routing table above to make sure it's not null before using it, let's build further on that. Let's get the number of shards from the index routing table with routingTable.size(). Then let's do a for loop bound by the shard count, asking for the shard routing table for each shard number with routingTable.shard(idx). After that, we can ask each shard routing table for its primary shard and iterate over any replicas it has with safer counting logic. What do you think?

This makes a lot of sense to me.

I just wonder about one thing, the method IndexShardRoutingTable.replicaShards returns a list of all replica shards but isn't the order of those replica shards also up to whatever code instantiates the IndexShardRoutingTable, so they might also have a random ordering?

That's likely a much smaller problem than the skipping index code I had before with the primaries marking the reset point.
I'm not sure if it would be a big problem if the order of the replica shards changed in the monitoring docs?

miltonhultgren · 2022-11-22T08:29:00Z

@jbaiera Thanks for the review, I'm going to address your comments ASAP!

…-allocations

miltonhultgren · 2022-11-22T20:06:02Z

...oring/src/main/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollector.java

@@ -46,28 +48,68 @@ protected Collection<MonitoringDoc> doCollect(final MonitoringDoc.Node node, fin
        if (clusterState != null) {


This file has a lot of these nested levels which we could get rid of by flipping the guard statements, but I'm not sure what the idiomatic style is here.

It is pretty deeply nested, but I don't think it's so bad that it needs a refactoring right now. A lot of this code is deprecated and on a course to being a distant memory when the internal monitoring feature is removed.

miltonhultgren · 2022-11-22T20:07:04Z

.../src/test/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollectorTests.java

+            if (document.getShardRouting().primary()) {
+                assertThat(document.getId(), equalTo(ShardMonitoringDoc.id(stateUUID, document.getShardRouting(), 0)));
+            } else {
+                assertThat(document.getId(), matchesPattern("^[\\w-]+:(_current|_na)+:_index:s\\d:r[1-9]+[0-9]*"));


At this point, I don't know which index is being used for which monitoring document, so I'm falling back to just checking the format and that at least there isn't any replica r0 (since the 0 would be the primary).

miltonhultgren · 2022-11-22T20:07:37Z

.../src/test/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollectorTests.java

@@ -130,15 +144,60 @@ public void testDoCollect() throws Exception {
    private static RoutingTable mockRoutingTable() {


The test setup here is starting to feel a bit messy, not sure what can be done about it.

…-allocations

miltonhultgren · 2022-12-15T13:42:32Z

@jbaiera I think I've addressed the comments you left, I'm afraid I won't have much more time to invest into this issue so I hope this is good enough or close to good enough!

...oring/src/main/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollector.java

jbaiera · 2022-12-15T21:04:14Z

@miltonhultgren Sorry for the review delay. This is very close indeed! Just a couple of small fixes and it should be good to go. If you find you aren't able to continue carrying it forward, you can set the PR to allow maintainers to make edits and I can finish the cleanup! Thanks again for iterating on it!

…-allocations

miltonhultgren · 2022-12-16T08:25:42Z

@jbaiera Thanks for taking another look, I pushed a commit based on the last two comments!

jbaiera

LGTM! Thanks so much for iterating on this, especially over the holidays where feedback is easily delayed.

…91153) * [Monitoring] Make unassigned replica shard documents unique * Fix failing tests * Use IndexRoutingTable/IndexShardRoutingTable to resolve shard order * Use Random in test when shuffling * Use Set to avoid duplicates and inline variable

[Monitoring] Make unassigned replica shard documents unique

af5b954

elasticsearchmachine added v8.6.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Oct 27, 2022

miltonhultgren mentioned this pull request Oct 27, 2022

[Stack Monitoring] Use doc ID to deduplicate shard allocations elastic/kibana#143963

Merged

miltonhultgren self-assigned this Oct 28, 2022

miltonhultgren added 2 commits November 10, 2022 17:53

Fix failing tests

1254963

Merge branch 'main' of github.com:elastic/elasticsearch into sm-shard…

bbf99a8

…-allocations

miltonhultgren marked this pull request as ready for review November 10, 2022 16:57

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Nov 10, 2022

miltonhultgren added >bug :Data Management/Monitoring Team:Data Management Meta label for data/management team auto-backport Automatically create backport pull requests when merged Team:Infra Monitoring UI and removed needs:triage Requires assignment of a team area label labels Nov 10, 2022

miltonhultgren requested a review from a team November 16, 2022 14:31

kingherc added v8.7.0 and removed v8.6.0 labels Nov 16, 2022

jbaiera requested review from jbaiera and removed request for a team November 17, 2022 19:20

jbaiera requested changes Nov 17, 2022

View reviewed changes

miltonhultgren added 2 commits November 22, 2022 10:57

Merge branch 'main' of github.com:elastic/elasticsearch into sm-shard…

325dd67

…-allocations

Merge branch 'main' of github.com:elastic/elasticsearch into sm-shard…

bae4a08

…-allocations

Use IndexRoutingTable/IndexShardRoutingTable to resolve shard order

10064b0

miltonhultgren requested a review from jbaiera November 22, 2022 20:04

miltonhultgren commented Nov 22, 2022

View reviewed changes

miltonhultgren added 3 commits November 23, 2022 16:24

Merge branch 'main' of github.com:elastic/elasticsearch into sm-shard…

3fc42fb

…-allocations

Use Random in test when shuffling

29ec5dd

Merge branch 'main' of github.com:elastic/elasticsearch into sm-shard…

59da890

…-allocations

jbaiera requested changes Dec 15, 2022

View reviewed changes

...oring/src/main/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollector.java Show resolved Hide resolved

...oring/src/main/java/org/elasticsearch/xpack/monitoring/collector/shards/ShardsCollector.java Outdated Show resolved Hide resolved

miltonhultgren added 2 commits December 16, 2022 09:22

Use Set to avoid duplicates and inline variable

933a9ed

Merge branch 'main' of github.com:elastic/elasticsearch into sm-shard…

3574b9b

…-allocations

miltonhultgren requested a review from jbaiera December 16, 2022 08:24

jbaiera approved these changes Jan 10, 2023

View reviewed changes

jbaiera added v8.6.1 auto-backport-and-merge and removed auto-backport Automatically create backport pull requests when merged labels Jan 10, 2023

jbaiera merged commit 232a64e into elastic:main Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Make unassigned replica shard documents unique #91153

[Monitoring] Make unassigned replica shard documents unique #91153

miltonhultgren commented Oct 27, 2022

miltonhultgren commented Oct 27, 2022

elasticsearchmachine commented Nov 10, 2022

jbaiera left a comment

jbaiera Nov 17, 2022

miltonhultgren Nov 22, 2022 •

edited

Loading

jbaiera Dec 7, 2022

miltonhultgren Dec 15, 2022 •

edited

Loading

jbaiera Nov 17, 2022

miltonhultgren Nov 22, 2022 •

edited

Loading

miltonhultgren commented Nov 22, 2022

miltonhultgren Nov 22, 2022

jbaiera Dec 7, 2022

miltonhultgren Nov 22, 2022

miltonhultgren Nov 22, 2022

miltonhultgren commented Dec 15, 2022

jbaiera commented Dec 15, 2022

miltonhultgren commented Dec 16, 2022

jbaiera left a comment

		@@ -46,28 +48,68 @@ protected Collection<MonitoringDoc> doCollect(final MonitoringDoc.Node node, fin
		if (clusterState != null) {

		@@ -130,15 +144,60 @@ public void testDoCollect() throws Exception {
		private static RoutingTable mockRoutingTable() {

[Monitoring] Make unassigned replica shard documents unique #91153

[Monitoring] Make unassigned replica shard documents unique #91153

Conversation

miltonhultgren commented Oct 27, 2022

Summary

How to test

miltonhultgren commented Oct 27, 2022

elasticsearchmachine commented Nov 10, 2022

jbaiera left a comment

Choose a reason for hiding this comment

jbaiera Nov 17, 2022

Choose a reason for hiding this comment

miltonhultgren Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

jbaiera Dec 7, 2022

Choose a reason for hiding this comment

miltonhultgren Dec 15, 2022 • edited Loading

Choose a reason for hiding this comment

jbaiera Nov 17, 2022

Choose a reason for hiding this comment

miltonhultgren Nov 22, 2022 • edited Loading

Choose a reason for hiding this comment

miltonhultgren commented Nov 22, 2022

miltonhultgren Nov 22, 2022

Choose a reason for hiding this comment

jbaiera Dec 7, 2022

Choose a reason for hiding this comment

miltonhultgren Nov 22, 2022

Choose a reason for hiding this comment

miltonhultgren Nov 22, 2022

Choose a reason for hiding this comment

miltonhultgren commented Dec 15, 2022

jbaiera commented Dec 15, 2022

miltonhultgren commented Dec 16, 2022

jbaiera left a comment

Choose a reason for hiding this comment

miltonhultgren Nov 22, 2022 •

edited

Loading

miltonhultgren Dec 15, 2022 •

edited

Loading

miltonhultgren Nov 22, 2022 •

edited

Loading