Polling cluster formation state for master-is-stable health indicator #88397

masseyke · 2022-07-08T18:24:07Z

As shown in step 1.2.2.4 in the diagram at #85624 (comment), if there has not been a node elected master for more than 30 seconds, then when a user queries a master node it is supposed to reach out to all other master-eligible nodes and summarize the results of their cluster formation knowledge. Since we don't want to block a user request while making those remote calls, it should be done speculatively so that we have the information when it is needed. If a master-eligible node gets a cluster state changed event that the master is null, then it will wait 10 seconds (in case it gets another cluster state changed event that the master has flipped to non-null quickly, which is the most likely scenario), then reach out to all of the other master-eligible nodes using a transport action (#87306) to get their cluster formation data. This data is stashed away in case a user request comes in. Since the user won't hit this path for 30 seconds after a master has become null (because we tell the user they have a stable master unless it's been missing for more than 30 seconds), we have plenty of time for this -- 10 seconds of delay in case we're notified that there is a new master plus 10 seconds to get the result back from the master-eligible node before timing out, plus 10 seconds to spare.
Since it includes nested callbacks, this code is a little hard to follow. Here's the outline:
For each master-eligible node in the cluster, we

Schedule a task to be performed after 10 seconds that connects to the master-eligible node and returns a handle to a Cancellable. This returns immediately because all of the work is done asynchronously.
1. On an asynchronous successful response from that call to connect, we send the transport request. This also returns immediately because its work is done asynchronously.
  1. On an asynchronous successful response from the transport request, we close the connection update the shared map with the result. Then we start back at step 1 again.
  2. On an asynchronous failure from the transport request, we close the connection and update the shared map with an exception. Then we start back at step 1 again.
2. On an asynchronous failure from connect, we update the shared map with an exception and then start back at step 1 again.

This PR was originally a part of #88020, but has been broken off to simplify the discussion. This PR only populates the CoordinationDiagnosticsService.clusterFormationResponses field, which will be used by an updated #88020 to drive the master stability logic.

elasticmachine · 2022-07-11T21:18:32Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2022-07-11T21:18:52Z

Hi @masseyke, I've created a changelog YAML for you.

andreidan

Thanks for working on this Keith.

I've left a few suggestions to (hopefully) simplify it a bit

andreidan · 2022-07-12T14:32:34Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+                    new ActionListenerResponseHandler<>(
+                        ActionListener.runAfter(
+                            ActionListener.runBefore(clusterFormationInfoResponseListener, () -> Releasables.close(releasable)),
+                            () -> new PollClusterFormationStateTask(node, nodeToClusterFormationStateMap, multipleCancellablesWrapper)
+                                .pollUntilCancelled()
+                        ),
+                        ClusterFormationInfoAction.Response::new


Could this be simplified?

Even if just documenting the order of execution?

ie

close(releasable) -> clusterFormationInfoResponseListener -> () -> new PollClusterFormationStateTask(node, nodeToClusterFormationStateMap, multipleCancellablesWrapper).pollUntilCancelled()

Hmm, actually is the runAfter needed still?

Could the last step (() -> new PollClusterFormationStateTask(node, nodeToClusterFormationStateMap, multipleCancellablesWrapper).pollUntilCancelled() ) be executed in the clusterFormationInfoResponseListener.whenComplete callback?

Could the last step (...) be executed in the clusterFormationInfoResponseListener.whenComplete callback?

It could, but it would have to be done twice (once in the success path and once in the fail path). The nice thing about runAfter and runBefore is that they happen either way (success or failure). I can add in a comment documenting how that works though.

andreidan · 2022-07-12T14:34:58Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+         */
+        public Scheduler.Cancellable pollUntilCancelled() {
+            StepListener<Releasable> connectionListener = new StepListener<>();
+            StepListener<ClusterFormationInfoAction.Response> clusterFormationInfoResponseListener = new StepListener<>();


Would this be more readable if named fetchClusterInfoListener ? The Response bit is a bit redundant maybe?

andreidan · 2022-07-12T14:44:38Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+         *
+         * @return
+         */
+        public Scheduler.Cancellable pollUntilCancelled() {


It seems to me that untilCancelled in the method here is not adding much information? Shall we just call it fetchClusterFormationInfo ? It could also receive the node as parameter?

Whilst here - is there a need for PollClusterFormationStateTask and MultipleCancellablesWrapper to exist?
I think their wrapping of global state makes things a bit more difficult to understand.

Maybe the fetchClusterFormationInfo could also receive a Consumer<ClusterFormationInfoAction.Response> and call that whenever it wants to pass the response to the outside world?
Given it returns a Cancellable the caller could aggregate them all in a list and cancel them without needing a MultipleCancellablesWrapper to aggregate them internally?

the signature I'm proposing is:

Scheduler.Cancellable fetchClusterFormationInfo(DiscoveryNode node, Consumer<ClusterFormationInfoAction.Response> responseConsumer)

Oh good point about PollClusterFormationStateTask's existence. In an earlier incarnation this was an actual Runnable, and pollUntilCancelled() was just run(). I improved that, but forgot to move the state out of PollClusterFormationStateTask and then get rid of it altogether. I think the signature will have to be a little different though. More on that in an upcoming comment.

Here's the signature I have now:
void fetchClusterFormationInfo(DiscoveryNode node, ConcurrentMap<DiscoveryNode, ClusterFormationStateOrException> nodeToClusterFormationStateMap, List<Scheduler.Cancellable> cancellables)
I can't just return a single Cancellable because one call to the method can generate lots of Cancellables (since the method schedules calls of itself). So I'm passing in the array of Cancellables that the cancelPollingClusterFormationInfo acts on. This avoids having to have the MultipleCancellablesWrapper but it effectively does the same thing. And I passed in a nodeToClusterFormationStateMap because I can't have just a ClusterFormationInfoAction.Response Consumer because one possible outcome is an Exception rather than a Response. I'll see if making that a Consumer makes it more or less readable.

OK the method now accepts a Consumer<ClusterFormationStateOrException> instead of a ClusterFormationStateOrException> nodeToClusterFormationStateMap. I think it helps out a little bit. The functionality is unchanged.

I think this is a step in the right direction, thanks for implementing it !

I can't just return a single Cancellable because one call to the method can generate lots of Cancellables (since the method schedules calls of itself)

Would this be an avenue for simplification?

Say the method signature is as follows

Scheduler.Cancellable fetchClusterFormationInfo(DiscoveryNode node, Consumer<ClusterFormationInfoAction.Response> responseConsumer)

This means the caller is responsible for re-calling fetchClusterFormationInfo method when the responseConsumer is called.

I think this would remove the need for the runAfter call in that runAfter(runBefore sequence because it seems we'll want to reschedule irrespective of being able to get the response or not - so when responseConsumer is called we should be rescheduling.
ie. we always call responseConsumer irrespective of how fetchClusterInfoListener completes:

new ActionListenerResponseHandler<>( ActionListener.runBefore(fetchClusterInfoListener, () -> Releasables.close(releasable)), ClusterFormationInfoAction.Response::new ) ... fetchClusterInfoListener.whenComplete(response -> { long endTime = System.nanoTime(); logger.trace("Received cluster coordination info from {} in {}", node, TimeValue.timeValueNanos(endTime - startTime)); responseConsumer.accept(new ClusterFormationStateOrException(response.getClusterFormationState())); }, e -> { logger.warn("Exception in cluster coordination info request to master node", e); responseConsumer.accept(new ClusterFormationStateOrException(e)); });

Consuming the fetchClusterFormationInfo would look something along the lines of:

getMasterEligibleNodes().forEach(masterNode -> { Consumer<ClusterFormationStateOrException> responseConsumer = response -> { if (clusterFormationResponses != null) { clusterFormationResponses.put(masterNode, response); } }; Scheduler.ScheduledCancellable scheduleFetch = fetchClusterFormationInfo(masterNode, responseConsumer.andThen(response -> { if (clusterFormationInfoTasks != null) { // reschedule the fetch if it wasn't cancelled already clusterFormationInfoTasks.add(fetchClusterFormationInfo(masterNode, responseConsumer)); } })); clusterFormationInfoTasks.add(scheduleFetch); });

This would reduce the scope of fetchClusterFormationInfo and the need for a Consumer<Cancellable>.

What do you think?

Unless I'm misunderstanding something, we can't use a Consumer<ClusterFormationInfoAction.Response> here because then we'd have no way of handling exceptions, right? And we need to keep track of exceptions we encounter.
And even using a Consumer<ClusterFormationStateOrException> I don't think we can take the approach you're suggesting will work as-is because it misses retries on a connectToNode exception (https://github.com/elastic/elasticsearch/pull/88397/files#diff-6898ba5666bbb351e267e9295e8feb7fbabf0d95d77aa4375c06cb2c16ad0a7dR529). I could probably add in a separate consumer for connectToNode, but at that point it's starting to get even more complex. Is there any use case that is missed by the current code?

Also I think your proposal exposes us to the race condition here right (although I'm not 100% sure that causes big problems)? #88397 (comment)
But that is fixable.

And even using a Consumer I don't think we can take the approach you're suggesting will work as-is because it misses retries on a connectToNode exception (https://github.com/elastic/elasticsearch/pull/88397/files#diff-6898ba5666bbb351e267e9295e8feb7fbabf0d95d77aa4375c06cb2c16ad0a7dR529)

Ah apologies I did mean Consumer<ClusterFormationStateOrException>. But if the connectToNode raises an exception the response consumer will still be called https://github.com/elastic/elasticsearch/pull/88397/files#diff-6898ba5666bbb351e267e9295e8feb7fbabf0d95d77aa4375c06cb2c16ad0a7dR524 (and the responseConsumer will track the exception and reschedule the fetch). Am I misisng something?

Is there any use case that is missed by the current code?

I think it's a bit difficult to follow, that's why I suggested reducing the scope of fetch... and have a clear view over which callback reschedules the polling and when that happens. I'd argue that if the suggestion I proposed above works (I might be missing something) it'll be clear from the method signature that fetch... schedules one Cancellable task and has one consumer that gets a success or an exception (which could be chained to reschedule the fetch via andThen...)

OK, I was mistaken about where it was failing. As we discussed offline, the actual reason it was failing was because the code at #88397 (comment) would stop polling after 2 attempts. We're now doing something very similar to that, but recursive so that it continues polling until cancelled.
We also discussed offline that I think the code above could throw NullPointerExceptions since clusterFormationInfoTasks and clusterFormationResponses could become null in between the null check and using them. Rather than synchronizing that code, I've put both of those references on the stack so that they're never null.

andreidan · 2022-07-12T14:47:02Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+         * This is a reference to the global nodeToClusterFormationStateMap that was current at the time this object was constructed. The
+         * global map is recreated whenever the task is cancelled. Having this reference prevents accidental writes to that map after
+         * cancellation.


This is a bit confusing as it implies we're copying it (ie "that was current at the time this object was constructed"), but also "references" it ... which means it's going to change in line with the global map.

I'm also not sure about the accidental writes to "that" map (it's "this" map isn't it :) ?) nor what's accidental about the said writes.

Could we maybe talk about the why we have this map here as opposed to the implementation details of how it arrived here? (apologies if I'm misunderstanding its purpose)

Update: maybe we should drop PollClusterFormationStateTask altogether, in which case ignore the above :)

Below is a scenario showing why I'm copying the clusterFormationResponses reference to and passing it in rather than using the global clusterFormationResponses. Maybe I'm being unnecessarily cautious.

We begin polling master eligible nodes A, B, C
We create a new global clusterFormationResponses map
We update the global clusterFormationResponses with results from A, B
We cancel polling
We begin polling master eligible nodes A, B, D (note that it's a different set)
We create a new global clusterFormationResponses map
We update the global clusterFormationResponses with results from C (the write was happening before cancel took effect and I don't think that it will interrupt a map insert but maybe I'm wrong)
We update the global clusterFormationResponses with results from A, B, D

So now we have result for A, B, C, and D, even though C is no longer part of the cluster. So we get weird results when we look to see if we can form a quorum, or if C knows about the existence of D. Or even if it's not as extreme as a node being replaced, we have results that might be out of date, causing confusion. My simple way to avoid worrying about all of this was to just pass the method a reference to the clusterFormationResponses map that it is supposed to be using.

andreidan · 2022-07-12T16:09:46Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+             */
+            beginPollingClusterFormationInfo();
+        } else {
+            cancelPollingClusterFormationInfo();


IMO we shouldn't be constantly calling this cancel... when the cluster is healthy, but only when re-gaining a master.
@DaveCTurner what do you think?

It's not a huge deal. Maybe we could use null instead of new ConcurrentHashMap<>() to distinguish "we're not polling" from "we're polling but have no entries"?

My thinking was that it's basically a no-op (there won't be any scheduled tasks most of the time), and the miniscule performance hit was worth the risk of complicating the code and/or accidentally not calling it when it needed to be called.
I can have cancelPollingClusterFormationInfo set clusterFormationResponses to null and have beginPollingClusterFormationInfo create a new ConcurrentHashMap<>() (I think that's what you're suggesting?) -- that would save a little garbage collection.

masseyke · 2022-07-12T22:32:26Z

@elasticmachine update branch

masseyke · 2022-07-13T20:35:15Z

@elasticmachine update branch

elasticmachine · 2022-07-13T20:35:17Z

expected head sha didn’t match current head ref.

andreidan

Thanks for iterating on this Keith

LGTM

andreidan · 2022-07-14T10:00:31Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+     * completed, adding the resulting Cancellable to cancellableConsumer.
+     * @param masterEligibleNode The node being polled
+     * @param responseConsumer The response consumer to be wrapped
+     * @param cancellableConsumer The list of Cancellables


this is not accurate anymore

andreidan · 2022-07-14T10:01:34Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+    }
+
+    /**
+     * This wraps the responseConsumer in a Consumer that will run beginPollingClusterFormationInfo() after responseConsumer has


it will run fetchClusterFormationInfo I believe?

andreidan · 2022-07-14T10:07:24Z

...rc/test/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsServiceTests.java

+        /*
+         * This test sets up a 4-node cluster (3 master eligible). We call beginPollingClusterFormationInfo() on each node. We then
+         * cancel all tasks. This simulates what will happen most often in practice -- polling is triggered when the master node goes
+         * null, and then polling is cancelled when a new master node is elected within 10 seconds. We then simulate the cluster running


"polling is cancelled when a new master node is elected within 10 seconds"

I think polling is cancelled immediately after a master node is detected. Maybe I'm misunderstanding what you meant here by the 10 seconds delay?

I'll update the wording -- it was meant to read as "polling in cancelled immediately when a new master node is elected, which is usually well within 10 seconds of when we were alerted that the master node went null" or something like that.

masseyke · 2022-07-14T14:55:55Z

@elasticmachine update branch

* upstream/master: (2974 commits) Reserved cluster state service (elastic#88527) Add transport action immutable state checks (elastic#88491) Remove suggest flag from index stats docs (elastic#85479) Polling cluster formation state for master-is-stable health indicator (elastic#88397) Add test execution guide in yamlRestTest asciidoc (elastic#88490) Add troubleshooting guide for corrupt repository (elastic#88391) [Transform] Finetune Schedule to be less noisy on retry and retry slower (elastic#88531) Updatable API keys - auto-update legacy RDs (elastic#88514) Fix typo in TransportForceMergeAction and TransportClearIndicesCacheA… (elastic#88064) Fixed NullPointerException on bulk request (elastic#88358) Avoid needless index metadata builders during reroute (elastic#88506) Set metadata on request in API key noop test (elastic#88507) Fix passing positional args to ES in Docker (elastic#88502) Improve description for task api detailed param (elastic#88493) Support cartesian shape with doc values (elastic#88487) Promote usage of Subjects in Authentication class (elastic#88494) Add CCx 2.0 feature flag (elastic#88451) Reword the watcher 'always' and 'never' condition docs (elastic#86105) Simplify azure discovery installation docs (elastic#88404) Breakup FIPS CI testing jobs ... # Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java # x-pack/plugin/mapper-aggregate-metric/src/main/java/org/elasticsearch/xpack/aggregatemetric/mapper/AggregateDoubleMetricFieldMapper.java

Polling cluster formation state

85eaacf

elasticsearchmachine added the v8.4.0 label Jul 8, 2022

masseyke added 12 commits July 8, 2022 13:53

cancelling all cancellables

85323a4

fixing compilation errors

dc3b581

adding unit test

5d215f3

spotlessApply

9cff8c7

simplifying

6eabf6d

more cleanup

3e38a50

more cleanup

7ee8a03

fixing ConcurrentModificationError

6605e87

braking apart huge nested callbacks

c5e7260

using StepListener

27ebfd7

Unit testing

e14af5b

checkstyle

919b5f7

masseyke changed the title ~~Polling cluster formation state~~ Polling cluster formation state for master-is-stable health indicator Jul 11, 2022

masseyke added 2 commits July 11, 2022 15:53

cleaning up

630a092

cleaning up

f407efc

masseyke added >enhancement :Data Management/Health labels Jul 11, 2022

masseyke marked this pull request as ready for review July 11, 2022 21:18

elasticmachine added the Team:Data Management Meta label for data/management team label Jul 11, 2022

masseyke requested review from andreidan and DaveCTurner and removed request for andreidan July 11, 2022 21:18

Update docs/changelog/88397.yaml

be04f87

masseyke requested a review from andreidan July 11, 2022 21:19

andreidan reviewed Jul 12, 2022

View reviewed changes

masseyke added 4 commits July 12, 2022 14:11

code review feedback

f760226

removing MultipleCancellablesWrapper(

4c127d9

improving comments

0168cfb

code review feedback

a282a0d

masseyke requested a review from andreidan July 12, 2022 22:31

elasticmachine and others added 7 commits July 13, 2022 08:02

Merge branch 'master' into feature/polling-cluster-formation-state

b7937a9

cleaning up

9ef64e5

cleaning up

69099bb

committing working code

963279f

committing working code

c86f450

code review feedback

fb77e27

cleanup

8cea8a5

andreidan approved these changes Jul 14, 2022

View reviewed changes

code review feedback

0144e60

Merge branch 'master' into feature/polling-cluster-formation-state

e7e8949

masseyke merged commit 3c5bf7e into elastic:master Jul 14, 2022

masseyke deleted the feature/polling-cluster-formation-state branch July 14, 2022 15:58

This was referenced Jul 14, 2022

Adding logic to master_is_stable indicator to check for discovery problems #88020

Merged

Polling for cluster diagnostics information #88562

Closed

masseyke mentioned this pull request Jul 26, 2022

[CI] TransportMonitoringMigrateAlertsActionTests classMethod failing #88759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polling cluster formation state for master-is-stable health indicator #88397

Polling cluster formation state for master-is-stable health indicator #88397

masseyke commented Jul 8, 2022 •

edited

Loading

elasticmachine commented Jul 11, 2022

elasticsearchmachine commented Jul 11, 2022

andreidan left a comment

andreidan Jul 12, 2022

masseyke Jul 12, 2022

andreidan Jul 12, 2022

andreidan Jul 12, 2022

masseyke Jul 12, 2022

masseyke Jul 12, 2022

masseyke Jul 12, 2022

andreidan Jul 13, 2022 •

edited

Loading

masseyke Jul 13, 2022

masseyke Jul 13, 2022

andreidan Jul 13, 2022 •

edited

Loading

masseyke Jul 13, 2022

andreidan Jul 12, 2022

masseyke Jul 12, 2022

andreidan Jul 12, 2022

DaveCTurner Jul 12, 2022

masseyke Jul 12, 2022

masseyke commented Jul 12, 2022

masseyke commented Jul 13, 2022

elasticmachine commented Jul 13, 2022

andreidan left a comment

andreidan Jul 14, 2022

andreidan Jul 14, 2022

andreidan Jul 14, 2022

masseyke Jul 14, 2022

masseyke commented Jul 14, 2022

Polling cluster formation state for master-is-stable health indicator #88397

Polling cluster formation state for master-is-stable health indicator #88397

Conversation

masseyke commented Jul 8, 2022 • edited Loading

elasticmachine commented Jul 11, 2022

elasticsearchmachine commented Jul 11, 2022

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan Jul 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan Jul 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masseyke commented Jul 12, 2022

masseyke commented Jul 13, 2022

elasticmachine commented Jul 13, 2022

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masseyke commented Jul 14, 2022

masseyke commented Jul 8, 2022 •

edited

Loading

andreidan Jul 13, 2022 •

edited

Loading

andreidan Jul 13, 2022 •

edited

Loading