Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DesiredBalanceComputerTests testDesiredBalanceShouldConvergeInABigCluster failing #104343

Closed
mark-vieira opened this issue Jan 12, 2024 · 8 comments · Fixed by #104442 or #108611
Closed
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) low-risk An open issue or test failure that is a low risk to future releases Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@mark-vieira
Copy link
Contributor

Reproduced locally for me.

Build scan:
https://gradle-enterprise.elastic.co/s/vfria3oy5zgew/tests/:server:test/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputerTests/testDesiredBalanceShouldConvergeInABigCluster

Reproduction line:

./gradlew ':server:test' --tests "org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputerTests.testDesiredBalanceShouldConvergeInABigCluster" -Dtests.seed=4C0C5812E11E17C5 -Dtests.locale=it -Dtests.timezone=Atlantic/Reykjavik -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputerTests#testDesiredBalanceShouldConvergeInABigCluster

Failure excerpt:

java.lang.AssertionError: All desired disk usages {node-3=33945040162, node-0=31673524702, node-1=42826606596, node-2=33757630192} should be smaller then actual disk sizes: 42660840495
Expected: every item is a value less than or equal to <42660840495L>
     but: an item <42826606596L> was greater than <42660840495L>

  at __randomizedtesting.SeedInfo.seed([4C0C5812E11E17C5:98A11F7C3AAEDA41]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:964)
  at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputerTests.testDesiredBalanceShouldConvergeInABigCluster(DesiredBalanceComputerTests.java:725)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

@mark-vieira mark-vieira added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI labels Jan 12, 2024
@elasticsearchmachine elasticsearchmachine added blocker Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Jan 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

mark-vieira added a commit that referenced this issue Jan 12, 2024
@mark-vieira
Copy link
Contributor Author

Muted given that it reproduces.

mark-vieira added a commit that referenced this issue Jan 12, 2024
mark-vieira added a commit that referenced this issue Jan 12, 2024
@ywangd ywangd self-assigned this Jan 15, 2024
@ywangd
Copy link
Member

ywangd commented Jan 15, 2024

@idegtiarenko I spent some time to study this test failure as an opportunity to learn a bit more about desired balance. IIUC, the test failed because the desired balance put more shards in size than a node's disk size. I tried to trace why it was the case. Since it was reproducible, I was able to trace that node-1 disk space gets violated when index [index-1][7] is assigned onto it.

It seems to me that ModelNode instances of the Balancer object keep tracking the expected disk usage (diskUsageInBytes) when a shard is allocated to it. However, DiskThresholdDecider does not consult with this information but instead always use allocation.clusterInfo().getNodeMostAvailableDiskUsages() which shows abundant disk space. Is this intended or a bug? That said, I may have totally looked into the wrong places or misinterpet the code and it might just be a test setup issue. The test failure does go away if I bump cluster.routing.allocation.balance.disk_usage from 2e-11 to 2e-10.

Could you please give me a few hints on how to proceed further? Thanks!

@idegtiarenko
Copy link
Contributor

Why was it muted in 8.11 and 8.12?

@idegtiarenko
Copy link
Contributor

The test failure is caused by changes in ClusterInfoSimulator #102207 (8.13 only, other branches are not affected, I am going to un-mute them shortly).

The test failure itself is a test bug as it does not set up initial disk space realistically. I am going to re-label this accordingly.

@idegtiarenko idegtiarenko added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Jan 15, 2024
@mark-vieira
Copy link
Contributor Author

Why was it muted in 8.11 and 8.12?

It reproduced on those branches for me locally.

@idegtiarenko
Copy link
Contributor

I was not able to reproduce locally on those branches.
Do you mind sharing the build scan failures for 8.11 and 8.12?
If that is the case then there might be other(s) changes causing this failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) low-risk An open issue or test failure that is a low risk to future releases Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
5 participants