Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiskThresholdDeciderIT.testHighWatermarkNotExceeded failure #62326

Closed
romseygeek opened this issue Sep 14, 2020 · 6 comments · Fixed by #62358, #63112 or #63614
Closed

DiskThresholdDeciderIT.testHighWatermarkNotExceeded failure #62326

romseygeek opened this issue Sep 14, 2020 · 6 comments · Fixed by #62358, #63112 or #63614
Assignees
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@romseygeek
Copy link
Contributor

Build scan:
https://gradle-enterprise.elastic.co/s/okz2lziucuxkq/tests/:server:internalClusterTest/org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT/testHighWatermarkNotExceeded

Repro line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testHighWatermarkNotExceeded" -Dtests.seed=60A6AF6A936EF834 -Dtests.security.manager=true -Dtests.locale=ro-RO -Dtests.timezone=America/Pangnirtung -Druntime.java=11

Reproduces locally?: no

Applicable branches: master

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?search.buildToolTypes=gradle&search.buildToolTypes=maven&search.relativeStartTime=P7D&search.timeZoneId=Europe/London&tests.container=org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT&tests.sortField=FAILED&tests.test=testHighWatermarkNotExceeded&tests.unstableOnly=true

Failure excerpt:

java.lang.AssertionError: |  
  | Expected: a collection with size <1> |  
  | but: collection size was <0>

at __randomizedtesting.SeedInfo.seed([60A6AF6A936EF834:89874ED813A831DA]:0) |  
  |   | at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) |  
  |   | at org.junit.Assert.assertThat(Assert.java:956) |  
  |   | at org.junit.Assert.assertThat(Assert.java:923) |  
  |   | at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testHighWatermarkNotExceeded(DiskThresholdDeciderIT.java:160) |  
  |  

@romseygeek romseygeek added >test-failure Triaged test failures from CI :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Sep 14, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Allocation)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 14, 2020
@original-brownbear original-brownbear self-assigned this Sep 14, 2020
@original-brownbear
Copy link
Member

This reproduces pretty easily for me locally over a few tens of runs. I'll look into a fix tomorrow

@droberts195
Copy link
Contributor

The test failed in the same way in 7.x in https://gradle-enterprise.elastic.co/s/x4m3oyeudq7ck in a commit that is more recent than the fix.

   java.lang.AssertionError:
   Expected: a collection with size <1>
   but: collection size was <0>

@danielmitterdorfer
Copy link
Member

tlrx added a commit that referenced this issue Oct 7, 2020
The first refreshDiskUsage() refreshes the ClusterInfo update which in turn 
calls listeners like DiskThreshMonitor. This one triggers a reroute as 
expected and turns an internal checkInProgress flag before submitting 
a cluster state update to relocate shards (the internal flag is toggled 
again once the cluster state update is processed).

In the test I suspect that the second refreshDiskUsage() may complete 
before DiskThreshMonitor's internal flag is set back to its initial state, 
resulting in the second ClusterInfo update to be ignored and message 
like "[node_t0] skipping monitor as a check is already in progress" to 
be logged. Adding another wait for languid events to be processed 
before executing the second refreshDiskUsage() should help here.

Closes #62326
tlrx added a commit to tlrx/elasticsearch that referenced this issue Oct 7, 2020
The first refreshDiskUsage() refreshes the ClusterInfo update which in turn 
calls listeners like DiskThreshMonitor. This one triggers a reroute as 
expected and turns an internal checkInProgress flag before submitting 
a cluster state update to relocate shards (the internal flag is toggled 
again once the cluster state update is processed).

In the test I suspect that the second refreshDiskUsage() may complete 
before DiskThreshMonitor's internal flag is set back to its initial state, 
resulting in the second ClusterInfo update to be ignored and message 
like "[node_t0] skipping monitor as a check is already in progress" to 
be logged. Adding another wait for languid events to be processed 
before executing the second refreshDiskUsage() should help here.

Closes elastic#62326
tlrx added a commit that referenced this issue Oct 7, 2020
)

The first refreshDiskUsage() refreshes the ClusterInfo update which in turn 
calls listeners like DiskThreshMonitor. This one triggers a reroute as 
expected and turns an internal checkInProgress flag before submitting 
a cluster state update to relocate shards (the internal flag is toggled 
again once the cluster state update is processed).

In the test I suspect that the second refreshDiskUsage() may complete 
before DiskThreshMonitor's internal flag is set back to its initial state, 
resulting in the second ClusterInfo update to be ignored and message 
like "[node_t0] skipping monitor as a check is already in progress" to 
be logged. Adding another wait for languid events to be processed 
before executing the second refreshDiskUsage() should help here.

Closes #62326
@cbuescher
Copy link
Member

Reopening since we face similar issues again on master and 7.x:
https://gradle-enterprise.elastic.co/s/dvlwxfzwu4yzc
https://gradle-enterprise.elastic.co/s/hcrfpa4y2h2io

Will mute on master, 7.x and 7.10

@cbuescher cbuescher reopened this Oct 8, 2020
@cbuescher
Copy link
Member

Muted with a615845, 0db9dd1 and 517d3e4

tlrx added a commit that referenced this issue Oct 14, 2020
This is another attempt to fix #62326 as my previous 
attempts failed (#63112, #63385).
tlrx added a commit to tlrx/elasticsearch that referenced this issue Oct 14, 2020
This is another attempt to fix elastic#62326 as my previous 
attempts failed (elastic#63112, elastic#63385).
tlrx added a commit to tlrx/elasticsearch that referenced this issue Oct 14, 2020
This is another attempt to fix elastic#62326 as my previous 
attempts failed (elastic#63112, elastic#63385).
tlrx added a commit that referenced this issue Oct 16, 2020
This is another attempt to fix #62326 as my previous 
attempts failed (#63112, #63385).
tlrx added a commit that referenced this issue Oct 16, 2020
This is another attempt to fix #62326 as my previous 
attempts failed (#63112, #63385).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
7 participants