Polling for cluster diagnostics information #88562

masseyke · 2022-07-14T22:04:31Z

As shown in step 1.2.2.3 in the diagram at #85624 (comment), if there has not been a node elected master for more than 30 seconds, then when a user queries a non-master-eligible node it is supposed to reach out to a master-eligible node to gather information about what that node thinks of master stability. Since we don't want to block a user request while making those remote calls, it should be done speculatively so that we have the information when it is needed. If a non-master-eligible node gets a cluster state changed event that the master is null, then it will wait 10 seconds (in case it gets another cluster state changed event that the master has flipped to non-null quickly, which is the most likely scenario), then reach out to a random master-eligible node using a transport action (#87984) to get its cluster diagnostics data. This data is stashed away in case a user request comes in. Since the user won't hit this path for 30 seconds after a master has become null (because we tell the user they have a stable master unless it's been missing for more than 30 seconds), we have plenty of time for this -- 10 seconds of delay in case we're notified that there is a new master plus 10 seconds to get the result back from the master-eligible node before timing out, plus 10 seconds to spare.
Note that the pattern used in this PR for doing asynchronous polling is very similar to what was done in #88397. The main differences are that we're polling a single master node here instead of all of them, and we're using a different transport action.
Also note that there will be a follow-up PR that actually uses this information to make decisions about master stability. This PR only keeps it for later use.

elasticsearchmachine · 2022-07-14T22:04:55Z

Hi @masseyke, I've created a changelog YAML for you.

elasticmachine · 2022-07-18T13:59:12Z

Pinging @elastic/es-data-management (Team:Data Management)

jbaiera · 2022-07-21T21:14:14Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+        Consumer<Scheduler.Cancellable> cancellableConsumer
+    ) {
+        masterEligibleNodes.stream().findAny().ifPresentOrElse(masterEligibleNode -> {
+            cancellableConsumer.accept(


It looks like this list will grow indefinitely while the master is missing. Should we simplify here? Is there ever a time where the list will contain more than one validly cancellable object?

That's a really good point, and the same thing impacts #88397 (it was an unintended side effect of a late change on that one). I'm pretty sure I could just keep one, and whenever I'm about to set the value I cancel the existing one. I think that would be better than potentially leaving a cancellable task running, right?

It only reschedules the execution of the remote diagnostic after it receives a response from the remote server right? At that point the item isn't really cancellable any more, it should be done. I think you're safe to just replace it in the regular rescheduling case, and leave any cancellation logic to code outside the regular reschedule loop?

Also it looks like the only thing that is actually cancellable is the thread pool task that is waiting 10 seconds. If a master node comes back and it attempts to cancel the task but it has already started the remote diagnostic, does the cancellation get ignored and the rescheduling continue?

Oh you're right -- we don't schedule this every 10 seconds. We schedule it every 10 seconds after the previous one completes (or errs out). So we just need to keep track of a single cancellable, and we only need to cancel it in the cancelPollingRemoteStableMasterHealthIndicatorService method. I'll update that now. Thanks.

If a master node comes back and it attempts to cancel the task but it has already started the remote diagnostic, does the cancellation get ignored

I believe I looked into this a few weeks ago, and I think I convinced myself that calling cancel issues an interrupt that interrupts network requests, but I could be wrong. Either way, I don't think there's much else we can do about it.

I believe I looked into this a few weeks ago, and I think I convinced myself that calling cancel issues an interrupt that interrupts network requests, but I could be wrong. Either way, I don't think there's much else we can do about it.

We might want to move the rescheduling logic into the runnable that is cancellable then - If we wait until after the results are in to reschedule the remove diagnostic then we might miss the cancellation when a new master shows up that says to stop collecting remote diagnostics.

Even if we use an atomic reference to store the active cancellable it's possible that we can still end up in a runaway execution I think. Example:

Thread A is running the remote diagnostic logic it is almost finished and about to reschedule the next execution

Thread B is trying to cancel the remote diagnostic

Thread B gets the cancellable task out of the atomic reference

Thread A schedules the next execution and replaces the old cancellable in the atomic reference

Thread B calls cancel() on the old cancellable and moves on

The remote diagnostic check will continue its execution 10 seconds later because the new cancellable has not been cancelled.

A couple ways we might be able to guard against this: We could mutex the cancellation and scheduling logic so we can't schedule anything new while something is being cancelled. Once cancelled, the scheduling logic will need to check something to see that it is indeed cancelled to avoid it scheduling anyway once it obtains the lock.

It might be even safer to modify the rescheduling logic to check if it should not reschedule any more. Either the current node has a master or the diagnostic response indicates there is a master, or something else?

OK I think I've taken care of this risk now. I'm putting an AtomicBoolean isCancelled variable on the stack each time we begin polling, and I'm synchronizing access in a few places. I've commented in the code where I'm not synchronizing access and why I think it's OK.

That's a really good point, and the same thing impacts #88397

@masseyke I don't think this is true as the polling in #88397 is started via the clusterStateChanged callback which is single threaded (maybe we should do the same here? left a more elaborated comment bellow)

I'm pretty sure it's still a problem (and actually a bigger one since there we're polling all nodes instead of one). But I'll put that into a separate PR.

andreidan

Thanks for working on this Keith.

I left a few comments and questions

andreidan · 2022-07-26T10:16:43Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

-     * This wraps the responseConsumer in a Consumer that will run rescheduleFetchConsumer() after responseConsumer has
+     * This wraps the responseConsumer in a Consumer that will run fetchClusterFormationInfo() after responseConsumer has


I don't think this comment is accurate:

responseConsumer.andThen( rescheduleClusterFormationFetchConsumer

I think it will run the rescheduleClusterFormationFetchConsumer

andreidan · 2022-07-26T10:20:57Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+        synchronized (remoteDiagnosticsMutex) {
+            if (remoteCoordinationDiagnosticsCancelled.get()) { // Don't start a 2nd one if the last hasn't been cancelled


The synchronized block can be avoided here by doing a compareAndSwap on the AtomicBolean - I renamed remoteCoordinationDiagnosticsCancelled to isRemoteFetchRunning but some better naming is needed here :)

Suggested change

synchronized (remoteDiagnosticsMutex) {

if (remoteCoordinationDiagnosticsCancelled.get()) { // Don't start a 2nd one if the last hasn't been cancelled

if (isRemoteFetchRunning.compareAndSwap(false, true)) { // Don't start a 2nd one if the last is still running

Is there a need for both remoteCoordinationDiagnosticsCancelled and isCancelled ? They seem to be the same thing - unless I'm missing something. Otherwise could we document a bit better what each does?

I'll add more comments. remoteCoordinationDiagnosticsCancelled is a reference to the current/most recent poll. It is kept as an object-level variable so that cancelPollingRemoteStableMasterHealthIndicatorService() can cancel it. isCancelled is a reference to the one for a specific poll, which is why it's kept on stack. So if we get to a point where there are two polls running at once, the older one will check its copy of isCancelled and see that cancelPollingRemoteStableMasterHealthIndicatorService() has cancelled it (even if the current value of remoteCoordinationDiagnosticsCancelled is true.

The synchronized block can be avoided here by doing a compareAndSwap on the AtomicBolean - I renamed remoteCoordinationDiagnosticsCancelled to isRemoteFetchRunning but some better naming is needed here :)

The reason for the mutex is to protect the assignment of 3 variables atomically -- remoteStableMasterHealthIndicatorTask, remoteCoordinationDiagnosisResult, and remoteCoordinationDiagnosticsCancelled, all while knowing that remoteCoordinationDiagnosticsCancelled is true. I don't think your suggestion covers that, does it?

These assignments happen inside the proposed CAS operation
ie.

if (isRemoteFetchRunning.compareAndSwap(false, true)) {

The one thread that gets to flip this boolean will be granted access to the contents of this if-block.
I believe it would be covered.

EDIT: I think I see what you mean - an interleaving cancelPollingRemoteStableMasterHealthIndicatorService call could override some of the assignments

andreidan · 2022-07-26T10:23:02Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+                    synchronized (remoteDiagnosticsMutex) {
+                        if (isCancelled.get()) {


Similar to above - since we have the Atomic* objects, CAS operations can provide the "execute this bit one time if the value is what I want it to be" semantics.
I don't think we need a mutex (unless I'm missing something?)

I think what you're missing is that it's not just protecting the read of one variable -- it's protecting the read and assignment of 3 variables -- remoteStableMasterHealthIndicatorTask, remoteCoordinationDiagnosisResult, and remoteCoordinationDiagnosticsCancelled

andreidan · 2022-07-26T10:23:54Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+        long startTime = System.nanoTime();
+        connectionListener.whenComplete(releasable -> {
+            if (isCancelled.get()) {
+                IOUtils.close(releasable);


Would this throw IOException?

It could. But previously we were using Releasables.close(), which also can throw an exception (it catches the IOException and rethrows it as an UncheckedIOException so that it doesn't cause compilation errors with the streams API but the result is the same. In this block I think it won't matter because this poll was cancelled anyway, so we don't need to make sure to reschedule anything. In the block below where Releasables.close() is used, I believe the exception will be caught by fetchCoordinationDiagnosticsListener's exception handler, and the task will be rescheduled.

andreidan · 2022-07-26T10:41:55Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+        Consumer<Scheduler.Cancellable> cancellableConsumer
+    ) {
+        masterEligibleNodes.stream().findAny().ifPresentOrElse(masterEligibleNode -> {
+            cancellableConsumer.accept(


That's a really good point, and the same thing impacts #88397

@masseyke I don't think this is true as the polling in #88397 is started via the clusterStateChanged callback which is single threaded (maybe we should do the same here? left a more elaborated comment bellow)

andreidan · 2022-07-26T11:42:23Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+
+    private void cancelPollingRemoteStableMasterHealthIndicatorService() {
+        synchronized (remoteDiagnosticsMutex) {
+            if (remoteStableMasterHealthIndicatorTask != null && remoteCoordinationDiagnosticsCancelled.getAndSet(true) == false) {


remoteCoordinationDiagnosticsCancelled.getAndSet(true) == false
Is compareAndSwap a better option here?

I don't think so, for the same reasons given above.

andreidan · 2022-07-26T11:50:42Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationDiagnosticsService.java

+            List<DiscoveryNode> masterEligibleNodes = new ArrayList<>(getMasterEligibleNodes());
+            if (masterEligibleNodes.isEmpty()) {
+                cancelPollingRemoteStableMasterHealthIndicatorService();
+            } else {
+                beginPollingRemoteStableMasterHealthIndicatorService(masterEligibleNodes);
+            }
+        }


I'm wondering. Are these the correct start/stop triggers for this polling? Even further, for the cancel... trigger - is the last call we receive for onFoundPeersUpdated when discovery ends always coinciding with an empty list of masterEligibleNodes in order to satisfy the cancel condition if (masterEligibleNodes.isEmpty()) ?

Would it be simpler to start the polling of one random master eligible node (as indicated here) from the cluster state change event notification that illustrates we lost the master node ? And stop the polling when we've got a master?
(similar to how we do it for the cluster formation fetch/cancel but in the case when the local node is not master eligible)

The onFoundPeersUpdated could be used to keep track of the discovered peers in a CopyOnWriteArrayList and the polling track could pick a random master eligible node from this list to poll.
(just one would be enough I think - based on the reference diagram - as long as one is reachable, and when it is not, we can source pick another one )

Would this simplify the design?

The problem is that non-master-eligible nodes do not have any peers in PeerFinder at the time that clusterChanged is called with a notification that the master is null. PeerFinder's peers list is populated at some point after that in another thread. I had originally done what you say here, but realized in an integration test (in an upcoming PR) that I would never get any remote diagnostics back because the list of master eligible nodes was always empty, so the polling would stop immediately.

Wouldn't the polling retry 10 seconds later?
ie. if there's nothing in the list of discovered peers we should just retry 10 seconds later and pick a random node from the list of peers at that point ... if any available - otherwise, maybe on the next execution 10 seconds later ...

PeerFinder's peers list is populated at some point after that in another thread.

The polling will eventually read a populated list. We could leave the polling track to attempt to read the master eligibile nodes using getMasterEligibleNodes (I guess preferable) or continue to use onFoundPeersUpdated to record locally a list of peers and made this local list available to the polling track.

masseyke · 2022-08-01T21:27:06Z

Closing this in favor of #89014

Polling for cluster diagnostics information

cdf137a

masseyke added >enhancement :Data Management/Health v8.4.0 labels Jul 14, 2022

masseyke added 2 commits July 14, 2022 17:04

Update docs/changelog/88562.yaml

439cb0c

unit testing

3d7fb1d

masseyke marked this pull request as ready for review July 18, 2022 13:59

elasticmachine added the Team:Data Management Meta label for data/management team label Jul 18, 2022

masseyke added 3 commits July 18, 2022 09:57

Merge branch 'master' into feature/polling-cluster-diag

a5b0e4c

never setting remoteCoordinationDiagnosisResult to null

12175ad

removing accidental comment

8ac0ad6

masseyke marked this pull request as draft July 19, 2022 20:02

masseyke mentioned this pull request Jul 19, 2022

Adding the ability to register a PeerFinderListener to Coordinator #88626

Merged

masseyke added 2 commits July 21, 2022 12:29

Merge branch 'master' into feature/polling-cluster-diag

9b1037c

Using PeerFinderListener to drive polling

f598cec

masseyke marked this pull request as ready for review July 21, 2022 20:33

masseyke requested a review from jbaiera July 21, 2022 20:33

registering listener

da5394a

jbaiera reviewed Jul 21, 2022

View reviewed changes

avoiding ever-growing list of Cancellables

3ee5874

masseyke requested a review from jbaiera July 22, 2022 15:15

elasticsearchmachine changed the base branch from master to main July 22, 2022 23:04

masseyke added 3 commits July 25, 2022 16:52

keeping track of cancelled state per task

72c211e

more comments and logging

bc38e8f

fixing a typo

da6f0bd

andreidan reviewed Jul 26, 2022

View reviewed changes

adding more comments

7dbaa91

mark-vieira removed the v8.4.0 label Jul 27, 2022

mark-vieira added the v8.5.0 label Jul 27, 2022

masseyke added 2 commits July 27, 2022 16:42

merging main

51863e8

merging main

e9be50f

masseyke closed this Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polling for cluster diagnostics information #88562

Polling for cluster diagnostics information #88562

masseyke commented Jul 14, 2022 •

edited

Loading

elasticsearchmachine commented Jul 14, 2022

elasticmachine commented Jul 18, 2022

jbaiera Jul 21, 2022

masseyke Jul 21, 2022

jbaiera Jul 22, 2022

jbaiera Jul 22, 2022

masseyke Jul 22, 2022

jbaiera Jul 22, 2022

masseyke Jul 25, 2022

andreidan Jul 26, 2022

masseyke Jul 26, 2022

andreidan left a comment

andreidan Jul 26, 2022

andreidan Jul 26, 2022

andreidan Jul 26, 2022

masseyke Jul 26, 2022

masseyke Jul 26, 2022

andreidan Jul 26, 2022 •

edited

Loading

andreidan Jul 26, 2022

masseyke Jul 26, 2022

andreidan Jul 26, 2022

masseyke Jul 26, 2022

andreidan Jul 26, 2022

andreidan Jul 26, 2022

masseyke Jul 26, 2022

andreidan Jul 26, 2022 •

edited

Loading

masseyke Jul 26, 2022

andreidan Jul 26, 2022 •

edited

Loading

masseyke commented Aug 1, 2022

		* This wraps the responseConsumer in a Consumer that will run rescheduleFetchConsumer() after responseConsumer has
		* This wraps the responseConsumer in a Consumer that will run fetchClusterFormationInfo() after responseConsumer has

		synchronized (remoteDiagnosticsMutex) {
		if (remoteCoordinationDiagnosticsCancelled.get()) { // Don't start a 2nd one if the last hasn't been cancelled

	synchronized (remoteDiagnosticsMutex) {
	if (remoteCoordinationDiagnosticsCancelled.get()) { // Don't start a 2nd one if the last hasn't been cancelled
	if (isRemoteFetchRunning.compareAndSwap(false, true)) { // Don't start a 2nd one if the last is still running

		synchronized (remoteDiagnosticsMutex) {
		if (isCancelled.get()) {

Polling for cluster diagnostics information #88562

Polling for cluster diagnostics information #88562

Conversation

masseyke commented Jul 14, 2022 • edited Loading

elasticsearchmachine commented Jul 14, 2022

elasticmachine commented Jul 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

masseyke commented Aug 1, 2022

masseyke commented Jul 14, 2022 •

edited

Loading

andreidan Jul 26, 2022 •

edited

Loading

andreidan Jul 26, 2022 •

edited

Loading

andreidan Jul 26, 2022 •

edited

Loading