-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DiskThresholdDeciderIT.testHighWatermarkNotExceeded failure #62326
Comments
Pinging @elastic/es-distributed (:Distributed/Allocation) |
This reproduces pretty easily for me locally over a few tens of runs. I'll look into a fix tomorrow |
The test failed in the same way in 7.x in https://gradle-enterprise.elastic.co/s/x4m3oyeudq7ck in a commit that is more recent than the fix.
|
Muted via |
The first refreshDiskUsage() refreshes the ClusterInfo update which in turn calls listeners like DiskThreshMonitor. This one triggers a reroute as expected and turns an internal checkInProgress flag before submitting a cluster state update to relocate shards (the internal flag is toggled again once the cluster state update is processed). In the test I suspect that the second refreshDiskUsage() may complete before DiskThreshMonitor's internal flag is set back to its initial state, resulting in the second ClusterInfo update to be ignored and message like "[node_t0] skipping monitor as a check is already in progress" to be logged. Adding another wait for languid events to be processed before executing the second refreshDiskUsage() should help here. Closes #62326
The first refreshDiskUsage() refreshes the ClusterInfo update which in turn calls listeners like DiskThreshMonitor. This one triggers a reroute as expected and turns an internal checkInProgress flag before submitting a cluster state update to relocate shards (the internal flag is toggled again once the cluster state update is processed). In the test I suspect that the second refreshDiskUsage() may complete before DiskThreshMonitor's internal flag is set back to its initial state, resulting in the second ClusterInfo update to be ignored and message like "[node_t0] skipping monitor as a check is already in progress" to be logged. Adding another wait for languid events to be processed before executing the second refreshDiskUsage() should help here. Closes elastic#62326
) The first refreshDiskUsage() refreshes the ClusterInfo update which in turn calls listeners like DiskThreshMonitor. This one triggers a reroute as expected and turns an internal checkInProgress flag before submitting a cluster state update to relocate shards (the internal flag is toggled again once the cluster state update is processed). In the test I suspect that the second refreshDiskUsage() may complete before DiskThreshMonitor's internal flag is set back to its initial state, resulting in the second ClusterInfo update to be ignored and message like "[node_t0] skipping monitor as a check is already in progress" to be logged. Adding another wait for languid events to be processed before executing the second refreshDiskUsage() should help here. Closes #62326
Reopening since we face similar issues again on master and 7.x: Will mute on master, 7.x and 7.10 |
This is another attempt to fix elastic#62326 as my previous attempts failed (elastic#63112, elastic#63385).
This is another attempt to fix elastic#62326 as my previous attempts failed (elastic#63112, elastic#63385).
Build scan:
https://gradle-enterprise.elastic.co/s/okz2lziucuxkq/tests/:server:internalClusterTest/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT/testHighWatermarkNotExceeded
Repro line:
Reproduces locally?: no
Applicable branches: master
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?search.buildToolTypes=gradle&search.buildToolTypes=maven&search.relativeStartTime=P7D&search.timeZoneId=Europe/London&tests.container=org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT&tests.sortField=FAILED&tests.test=testHighWatermarkNotExceeded&tests.unstableOnly=true
Failure excerpt:
The text was updated successfully, but these errors were encountered: