[CI] DesiredBalanceComputerTests testDesiredBalanceShouldConvergeInABigCluster failing #104343

mark-vieira · 2024-01-12T21:45:18Z

Reproduced locally for me.

Build scan:
https://gradle-enterprise.elastic.co/s/vfria3oy5zgew/tests/:server:test/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputerTests/testDesiredBalanceShouldConvergeInABigCluster

Reproduction line:

./gradlew ':server:test' --tests "org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputerTests.testDesiredBalanceShouldConvergeInABigCluster" -Dtests.seed=4C0C5812E11E17C5 -Dtests.locale=it -Dtests.timezone=Atlantic/Reykjavik -Druntime.java=21

Applicable branches:
main

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputerTests#testDesiredBalanceShouldConvergeInABigCluster

Failure excerpt:

java.lang.AssertionError: All desired disk usages {node-3=33945040162, node-0=31673524702, node-1=42826606596, node-2=33757630192} should be smaller then actual disk sizes: 42660840495
Expected: every item is a value less than or equal to <42660840495L>
     but: an item <42826606596L> was greater than <42660840495L>

  at __randomizedtesting.SeedInfo.seed([4C0C5812E11E17C5:98A11F7C3AAEDA41]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:964)
  at org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputerTests.testDesiredBalanceShouldConvergeInABigCluster(DesiredBalanceComputerTests.java:725)
  at jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
  at java.lang.reflect.Method.invoke(Method.java:580)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-01-12T21:45:42Z

Pinging @elastic/es-distributed (Team:Distributed)

mark-vieira · 2024-01-12T21:46:20Z

Muted given that it reproduces.

ywangd · 2024-01-15T07:26:04Z

@idegtiarenko I spent some time to study this test failure as an opportunity to learn a bit more about desired balance. IIUC, the test failed because the desired balance put more shards in size than a node's disk size. I tried to trace why it was the case. Since it was reproducible, I was able to trace that node-1 disk space gets violated when index [index-1][7] is assigned onto it.

It seems to me that ModelNode instances of the Balancer object keep tracking the expected disk usage (diskUsageInBytes) when a shard is allocated to it. However, DiskThresholdDecider does not consult with this information but instead always use allocation.clusterInfo().getNodeMostAvailableDiskUsages() which shows abundant disk space. Is this intended or a bug? That said, I may have totally looked into the wrong places or misinterpet the code and it might just be a test setup issue. The test failure does go away if I bump cluster.routing.allocation.balance.disk_usage from 2e-11 to 2e-10.

Could you please give me a few hints on how to proceed further? Thanks!

idegtiarenko · 2024-01-15T09:16:29Z

Why was it muted in 8.11 and 8.12?

idegtiarenko · 2024-01-15T13:53:14Z

The test failure is caused by changes in ClusterInfoSimulator #102207 (8.13 only, other branches are not affected, I am going to un-mute them shortly).

The test failure itself is a test bug as it does not set up initial disk space realistically. I am going to re-label this accordingly.

mark-vieira · 2024-01-16T16:00:26Z

Why was it muted in 8.11 and 8.12?

It reproduced on those branches for me locally.

idegtiarenko · 2024-01-16T16:07:03Z

I was not able to reproduce locally on those branches.
Do you mind sharing the build scan failures for 8.11 and 8.12?
If that is the case then there might be other(s) changes causing this failure.

ldematte · 2024-04-24T08:54:43Z

It happened today again, in main: https://gradle-enterprise.elastic.co/s/3jrclx7mjwnyu/tests/task/:server:test/details/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceComputerTests/testDesiredBalanceShouldConvergeInABigCluster?top-execution=1

mark-vieira added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test-failure Triaged test failures from CI labels Jan 12, 2024

elasticsearchmachine added blocker Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Jan 12, 2024

mark-vieira added a commit that referenced this issue Jan 12, 2024

AwaitsFix #104343

63b3e66

mark-vieira added a commit that referenced this issue Jan 12, 2024

AwaitsFix #104343

81fed57

mark-vieira added a commit that referenced this issue Jan 12, 2024

AwaitsFix #104343

5d887b4

ywangd self-assigned this Jan 15, 2024

idegtiarenko added low-risk An open issue or test failure that is a low risk to future releases and removed blocker labels Jan 15, 2024

This was referenced Jan 17, 2024

Fix testDesiredBalanceShouldConvergeInABigCluster #104442

Merged

Unmute testDesiredBalanceShouldConvergeInABigCluster #104445

Merged

idegtiarenko closed this as completed in #104442 Jan 18, 2024

ldematte reopened this Apr 24, 2024

DaveCTurner added a commit that referenced this issue May 1, 2024

AwaitsFix for #104343

fc71923

idegtiarenko mentioned this issue May 14, 2024

Fix testDesiredBalanceShouldConvergeInABigCluster #108611

Merged

idegtiarenko closed this as completed in #108611 May 16, 2024

ywangd removed their assignment May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] DesiredBalanceComputerTests testDesiredBalanceShouldConvergeInABigCluster failing #104343

[CI] DesiredBalanceComputerTests testDesiredBalanceShouldConvergeInABigCluster failing #104343

mark-vieira commented Jan 12, 2024

elasticsearchmachine commented Jan 12, 2024

mark-vieira commented Jan 12, 2024

ywangd commented Jan 15, 2024

idegtiarenko commented Jan 15, 2024

idegtiarenko commented Jan 15, 2024

mark-vieira commented Jan 16, 2024

idegtiarenko commented Jan 16, 2024

ldematte commented Apr 24, 2024

[CI] DesiredBalanceComputerTests testDesiredBalanceShouldConvergeInABigCluster failing #104343

[CI] DesiredBalanceComputerTests testDesiredBalanceShouldConvergeInABigCluster failing #104343

Comments

mark-vieira commented Jan 12, 2024

elasticsearchmachine commented Jan 12, 2024

mark-vieira commented Jan 12, 2024

ywangd commented Jan 15, 2024

idegtiarenko commented Jan 15, 2024

idegtiarenko commented Jan 15, 2024

mark-vieira commented Jan 16, 2024

idegtiarenko commented Jan 16, 2024

ldematte commented Apr 24, 2024