-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polling for cluster diagnostics information #88562
Conversation
Hi @masseyke, I've created a changelog YAML for you. |
Pinging @elastic/es-data-management (Team:Data Management) |
Consumer<Scheduler.Cancellable> cancellableConsumer | ||
) { | ||
masterEligibleNodes.stream().findAny().ifPresentOrElse(masterEligibleNode -> { | ||
cancellableConsumer.accept( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this list will grow indefinitely while the master is missing. Should we simplify here? Is there ever a time where the list will contain more than one validly cancellable object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good point, and the same thing impacts #88397 (it was an unintended side effect of a late change on that one). I'm pretty sure I could just keep one, and whenever I'm about to set the value I cancel the existing one. I think that would be better than potentially leaving a cancellable task running, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only reschedules the execution of the remote diagnostic after it receives a response from the remote server right? At that point the item isn't really cancellable any more, it should be done. I think you're safe to just replace it in the regular rescheduling case, and leave any cancellation logic to code outside the regular reschedule loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it looks like the only thing that is actually cancellable is the thread pool task that is waiting 10 seconds. If a master node comes back and it attempts to cancel the task but it has already started the remote diagnostic, does the cancellation get ignored and the rescheduling continue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right -- we don't schedule this every 10 seconds. We schedule it every 10 seconds after the previous one completes (or errs out). So we just need to keep track of a single cancellable, and we only need to cancel it in the cancelPollingRemoteStableMasterHealthIndicatorService
method. I'll update that now. Thanks.
If a master node comes back and it attempts to cancel the task but it has already started the remote diagnostic, does the cancellation get ignored
I believe I looked into this a few weeks ago, and I think I convinced myself that calling cancel issues an interrupt that interrupts network requests, but I could be wrong. Either way, I don't think there's much else we can do about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I looked into this a few weeks ago, and I think I convinced myself that calling cancel issues an interrupt that interrupts network requests, but I could be wrong. Either way, I don't think there's much else we can do about it.
We might want to move the rescheduling logic into the runnable that is cancellable then - If we wait until after the results are in to reschedule the remove diagnostic then we might miss the cancellation when a new master shows up that says to stop collecting remote diagnostics.
Even if we use an atomic reference to store the active cancellable it's possible that we can still end up in a runaway execution I think. Example:
- Thread A is running the remote diagnostic logic it is almost finished and about to reschedule the next execution
- Thread B is trying to cancel the remote diagnostic
- Thread B gets the cancellable task out of the atomic reference
- Thread A schedules the next execution and replaces the old cancellable in the atomic reference
- Thread B calls cancel() on the old cancellable and moves on
- The remote diagnostic check will continue its execution 10 seconds later because the new cancellable has not been cancelled.
A couple ways we might be able to guard against this: We could mutex the cancellation and scheduling logic so we can't schedule anything new while something is being cancelled. Once cancelled, the scheduling logic will need to check something to see that it is indeed cancelled to avoid it scheduling anyway once it obtains the lock.
It might be even safer to modify the rescheduling logic to check if it should not reschedule any more. Either the current node has a master or the diagnostic response indicates there is a master, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I think I've taken care of this risk now. I'm putting an AtomicBoolean isCancelled
variable on the stack each time we begin polling, and I'm synchronizing access in a few places. I've commented in the code where I'm not synchronizing access and why I think it's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure it's still a problem (and actually a bigger one since there we're polling all nodes instead of one). But I'll put that into a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this Keith.
I left a few comments and questions
* This wraps the responseConsumer in a Consumer that will run rescheduleFetchConsumer() after responseConsumer has | ||
* This wraps the responseConsumer in a Consumer that will run fetchClusterFormationInfo() after responseConsumer has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this comment is accurate:
responseConsumer.andThen(
rescheduleClusterFormationFetchConsumer
I think it will run the rescheduleClusterFormationFetchConsumer
synchronized (remoteDiagnosticsMutex) { | ||
if (remoteCoordinationDiagnosticsCancelled.get()) { // Don't start a 2nd one if the last hasn't been cancelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synchronized block can be avoided here by doing a compareAndSwap
on the AtomicBolean
- I renamed remoteCoordinationDiagnosticsCancelled
to isRemoteFetchRunning
but some better naming is needed here :)
synchronized (remoteDiagnosticsMutex) { | |
if (remoteCoordinationDiagnosticsCancelled.get()) { // Don't start a 2nd one if the last hasn't been cancelled | |
if (isRemoteFetchRunning.compareAndSwap(false, true)) { // Don't start a 2nd one if the last is still running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need for both remoteCoordinationDiagnosticsCancelled
and isCancelled
? They seem to be the same thing - unless I'm missing something. Otherwise could we document a bit better what each does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add more comments. remoteCoordinationDiagnosticsCancelled
is a reference to the current/most recent poll. It is kept as an object-level variable so that cancelPollingRemoteStableMasterHealthIndicatorService()
can cancel it. isCancelled
is a reference to the one for a specific poll, which is why it's kept on stack. So if we get to a point where there are two polls running at once, the older one will check its copy of isCancelled
and see that cancelPollingRemoteStableMasterHealthIndicatorService()
has cancelled it (even if the current value of remoteCoordinationDiagnosticsCancelled
is true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The synchronized block can be avoided here by doing a
compareAndSwap
on theAtomicBolean
- I renamedremoteCoordinationDiagnosticsCancelled
toisRemoteFetchRunning
but some better naming is needed here :)
The reason for the mutex is to protect the assignment of 3 variables atomically -- remoteStableMasterHealthIndicatorTask
, remoteCoordinationDiagnosisResult
, and remoteCoordinationDiagnosticsCancelled
, all while knowing that remoteCoordinationDiagnosticsCancelled
is true
. I don't think your suggestion covers that, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These assignments happen inside the proposed CAS operation
ie.
if (isRemoteFetchRunning.compareAndSwap(false, true)) {
The one thread that gets to flip this boolean will be granted access to the contents of this if-block.
I believe it would be covered.
EDIT: I think I see what you mean - an interleaving cancelPollingRemoteStableMasterHealthIndicatorService
call could override some of the assignments
synchronized (remoteDiagnosticsMutex) { | ||
if (isCancelled.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above - since we have the Atomic*
objects, CAS operations can provide the "execute this bit one time if the value is what I want it to be" semantics.
I don't think we need a mutex
(unless I'm missing something?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you're missing is that it's not just protecting the read of one variable -- it's protecting the read and assignment of 3 variables -- remoteStableMasterHealthIndicatorTask
, remoteCoordinationDiagnosisResult
, and remoteCoordinationDiagnosticsCancelled
long startTime = System.nanoTime(); | ||
connectionListener.whenComplete(releasable -> { | ||
if (isCancelled.get()) { | ||
IOUtils.close(releasable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this throw IOException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could. But previously we were using Releasables.close()
, which also can throw an exception (it catches the IOException and rethrows it as an UncheckedIOException
so that it doesn't cause compilation errors with the streams API but the result is the same. In this block I think it won't matter because this poll was cancelled anyway, so we don't need to make sure to reschedule anything. In the block below where Releasables.close()
is used, I believe the exception will be caught by fetchCoordinationDiagnosticsListener
's exception handler, and the task will be rescheduled.
Consumer<Scheduler.Cancellable> cancellableConsumer | ||
) { | ||
masterEligibleNodes.stream().findAny().ifPresentOrElse(masterEligibleNode -> { | ||
cancellableConsumer.accept( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
private void cancelPollingRemoteStableMasterHealthIndicatorService() { | ||
synchronized (remoteDiagnosticsMutex) { | ||
if (remoteStableMasterHealthIndicatorTask != null && remoteCoordinationDiagnosticsCancelled.getAndSet(true) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remoteCoordinationDiagnosticsCancelled.getAndSet(true) == false
Is compareAndSwap
a better option here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, for the same reasons given above.
List<DiscoveryNode> masterEligibleNodes = new ArrayList<>(getMasterEligibleNodes()); | ||
if (masterEligibleNodes.isEmpty()) { | ||
cancelPollingRemoteStableMasterHealthIndicatorService(); | ||
} else { | ||
beginPollingRemoteStableMasterHealthIndicatorService(masterEligibleNodes); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering. Are these the correct start/stop triggers for this polling? Even further, for the cancel...
trigger - is the last call we receive for onFoundPeersUpdated
when discovery ends always coinciding with an empty list of masterEligibleNodes
in order to satisfy the cancel condition if (masterEligibleNodes.isEmpty())
?
Would it be simpler to start the polling of one random master eligible node (as indicated here) from the cluster state change event notification that illustrates we lost the master node ? And stop the polling when we've got a master?
(similar to how we do it for the cluster formation fetch/cancel but in the case when the local node is not master eligible)
The onFoundPeersUpdated
could be used to keep track of the discovered peers in a CopyOnWriteArrayList
and the polling track could pick a random master eligible node from this list to poll.
(just one would be enough I think - based on the reference diagram - as long as one is reachable, and when it is not, we can source pick another one )
Would this simplify the design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that non-master-eligible nodes do not have any peers in PeerFinder at the time that clusterChanged
is called with a notification that the master is null. PeerFinder's peers list is populated at some point after that in another thread. I had originally done what you say here, but realized in an integration test (in an upcoming PR) that I would never get any remote diagnostics back because the list of master eligible nodes was always empty, so the polling would stop immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the polling retry 10 seconds later?
ie. if there's nothing in the list of discovered peers we should just retry 10 seconds later and pick a random node from the list of peers at that point ... if any available - otherwise, maybe on the next execution 10 seconds later ...
PeerFinder's peers list is populated at some point after that in another thread.
The polling will eventually read a populated list. We could leave the polling track to attempt to read the master eligibile nodes using getMasterEligibleNodes
(I guess preferable) or continue to use onFoundPeersUpdated
to record locally a list of peers and made this local list available to the polling track.
Closing this in favor of #89014 |
As shown in step 1.2.2.3 in the diagram at #85624 (comment), if there has not been a node elected master for more than 30 seconds, then when a user queries a non-master-eligible node it is supposed to reach out to a master-eligible node to gather information about what that node thinks of master stability. Since we don't want to block a user request while making those remote calls, it should be done speculatively so that we have the information when it is needed. If a non-master-eligible node gets a cluster state changed event that the master is null, then it will wait 10 seconds (in case it gets another cluster state changed event that the master has flipped to non-null quickly, which is the most likely scenario), then reach out to a random master-eligible node using a transport action (#87984) to get its cluster diagnostics data. This data is stashed away in case a user request comes in. Since the user won't hit this path for 30 seconds after a master has become null (because we tell the user they have a stable master unless it's been missing for more than 30 seconds), we have plenty of time for this -- 10 seconds of delay in case we're notified that there is a new master plus 10 seconds to get the result back from the master-eligible node before timing out, plus 10 seconds to spare.
Note that the pattern used in this PR for doing asynchronous polling is very similar to what was done in #88397. The main differences are that we're polling a single master node here instead of all of them, and we're using a different transport action.
Also note that there will be a follow-up PR that actually uses this information to make decisions about master stability. This PR only keeps it for later use.