-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify and Fix Synchronization in InternalTestCluster #39168
Simplify and Fix Synchronization in InternalTestCluster #39168
Conversation
* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized`
Pinging @elastic/es-core-infra |
@DaveCTurner found a bunch more possible races here than the one that hit #39118, not sure if you want to review this or I should find someone else? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @original-brownbear .
I am inclined to think that making nodes immutable and creating a new map when adding/removing from the map is a better approach. Especially getClients() is problematic, but also the way we currently rely on the InternalTestCluster
monitor ensuring nodes
is not changed when reading from nodes
. Having it as an immutable map makes it easy and cheap to ensure consistent snapshots for the map. Also, it is then obvious that anything accessing nodes
will never block.
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
@@ -2245,7 +2240,7 @@ public void clearDisruptionScheme() { | |||
clearDisruptionScheme(true); | |||
} | |||
|
|||
public void clearDisruptionScheme(boolean ensureHealthyCluster) { | |||
public synchronized void clearDisruptionScheme(boolean ensureHealthyCluster) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can wait for a healthy cluster. I think having the entire method synchronized while waiting for the cluster to become healthy could potentially lead to deadlocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question ..., I couldn't find a possible dead-lock in the implementations of our disruptions from a quick look over them.
My thinking would maybe be this:
If we don't synchronize here, we allow manipulating the cluster while we "wait for healthy" which could lead to some pretty hard to debug issues. Also, we really don't want to manipulate anything about the cluster while this method is in progress.
=> If we create some unforeseen deadlock here, I'd probably rather try to fix the implementation of the disruption to prevent the deadlock, then allow concurrent modification of the cluster while we clear the disruption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is any chance of manipulations during this phase, I would rather guard against manipulating the cluster explicitly by adding this intermediate closing (or stopped) state.
Do we not risk something like what you described in #39118. If the disruption prevented the result from returning, the callback could be called at this time. If that in turn calls any of the synchronized methods it could potentially deadlock if we have to create a new connection while becoming healthy?
At a minimum I think we should add a comment why the synchronized is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment for now. But I'm starting to think we're attacking this from the wrong angle to some degree. It seems like methods like this one (and a few others we now discussed) are currently only called from the main JUnit thread. Why, instead of worrying endlessly about how we sync. things like e.g. clearing the disruption while closing and such not just assert that we're on the main JUnit thread and simply not allow manipulating the cluster from elsewhere. We currently don't seem to be doing that and I don't see a good reason to start doing that kind of thing either (+ if someone needs this kind of thing down the line, they're free to add it as needed).
IMO, that would make calls to e.g. InternalTestCluster#restartRandomDataNode(org.elasticsearch.test.InternalTestCluster.RestartCallback)
a lot easier to follow/debug.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also did not find any specific places where we deliberately manipulate the cluster in other threads (though I have not done an exhaustive search). However, it is not obvious that calling for instance client() could be invalid on a thread (if client or even node is lazily created, implicitly manipulating the cluster)? Also, I wonder if disruptive restart tests could be good to add and if that would be harder to then add since all changes have to be done in main thread. I think the code is now much clearer with this PR and would prefer to leave it with synchronized in place.
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Show resolved
Hide resolved
@henningandersen thanks for the thorough review! Moved to an immutable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getClients
still has an issue (also prior to this PR) in that when you call next() it also calls NodeAndClient.client()
, which lazily builds the client. I think this mandates solving that part of the problem using an explicit monitor (does it have to be the InternalTestCluster monitor to be safe wrt closing?) in NodeAndClient?
In turn I think this would remove the need for several of the client accessor methods to be synchronized (including smartClient).
Also (nit), getClients()
does not need to be synchronized.
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
@@ -2245,7 +2240,7 @@ public void clearDisruptionScheme() { | |||
clearDisruptionScheme(true); | |||
} | |||
|
|||
public void clearDisruptionScheme(boolean ensureHealthyCluster) { | |||
public synchronized void clearDisruptionScheme(boolean ensureHealthyCluster) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is any chance of manipulations during this phase, I would rather guard against manipulating the cluster explicitly by adding this intermediate closing (or stopped) state.
Do we not risk something like what you described in #39118. If the disruption prevented the result from returning, the callback could be called at this time. If that in turn calls any of the synchronized methods it could potentially deadlock if we have to create a new connection while becoming healthy?
At a minimum I think we should add a comment why the synchronized is there.
I think for now let's sync on the
🎉 true :) |
Not really I think because we're still on the same thread. The |
@henningandersen Thanks for all the finds! All points addressed again I think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @original-brownbear , I left just a few comments otherwise looking good.
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Show resolved
Hide resolved
@@ -2245,7 +2240,7 @@ public void clearDisruptionScheme() { | |||
clearDisruptionScheme(true); | |||
} | |||
|
|||
public void clearDisruptionScheme(boolean ensureHealthyCluster) { | |||
public synchronized void clearDisruptionScheme(boolean ensureHealthyCluster) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also did not find any specific places where we deliberately manipulate the cluster in other threads (though I have not done an exhaustive search). However, it is not obvious that calling for instance client() could be invalid on a thread (if client or even node is lazily created, implicitly manipulating the cluster)? Also, I wonder if disruptive restart tests could be good to add and if that would be harder to then add since all changes have to be done in main thread. I think the code is now much clearer with this PR and would prefer to leave it with synchronized in place.
test/framework/src/main/java/org/elasticsearch/test/InternalTestCluster.java
Outdated
Show resolved
Hide resolved
@henningandersen thanks! all addressed I think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks @original-brownbear
@henningandersen thanks for the great review :) |
* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345
* elastic/master: Ensure index commit released when testing timeouts (elastic#39273) Avoid using TimeWarp in TransformIntegrationTests. (elastic#39277) Fixed missed stopping of SchedulerEngine (elastic#39193) [CI] Mute CcrRetentionLeaseIT.testRetentionLeaseIsRenewedDuringRecovery (elastic#39269) Muting AutoFollowIT.testAutoFollowManyIndices (elastic#39264) Clarify the use of sleep in CCR test Fix testCannotShrinkLeaderIndex (elastic#38529) Fix CCR tests that manipulate transport requests Align generated release notes with doc standards (elastic#39234) Mute test (elastic#39248) ReadOnlyEngine should update translog recovery state information (elastic#39238) Wrap accounting breaker check in assertBusy (elastic#39211) Simplify and Fix Synchronization in InternalTestCluster (elastic#39168) [Tests] Make testEngineGCDeletesSetting deterministic (elastic#38942) Extend nextDoc to delegate to the wrapped doc-value iterator for date_nanos (elastic#39176) Change ShardFollowTask to reuse common serialization logic (elastic#39094) Replace superfluous usage of Counter with Supplier (elastic#39048) Disable bwc tests for elastic#39094
* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345
* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345
Should this be backported to 7.0 (and possibly earlier) to avoid test failures? |
@ywelsch yea definitely! (sorry for forgetting that), I'll back port to 7.0 and will look into how tricky it is to get this into |
* Remove unnecessary `synchronized` statements * Make `Predicate`s constants where possible * Cleanup some stream usage * Make unsafe public methods `synchronized` * Closes elastic#37965 * Closes elastic#37275 * Closes elastic#37345
back ported to 7.0 in #40013 |
@ywelsch back porting this to |
synchronized
statementsPredicate
s constants where possiblesynchronized
final
where possiblenodes
(we were using it in the unicast hosts file builder without any sync!)nodes
synchronized as the docs claim alreadythis
or make a copy where possible to avoid the lock onthis