Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky-test: ZKSessionTest.testReacquireLeadershipAfterSessionLost #11032

Closed
lhotari opened this issue Jun 23, 2021 · 9 comments · Fixed by #11062 or #19769
Closed

Flaky-test: ZKSessionTest.testReacquireLeadershipAfterSessionLost #11032

lhotari opened this issue Jun 23, 2021 · 9 comments · Fixed by #11062 or #19769
Assignees

Comments

@lhotari
Copy link
Member

lhotari commented Jun 23, 2021

ZKSessionTest is flaky. The testReacquireLeadershipAfterSessionLost test method fails sporadically.

example failure

Error:  Tests run: 6, Failures: 1, Errors: 0, Skipped: 4, Time elapsed: 31.46 s <<< FAILURE! - in org.apache.pulsar.metadata.ZKSessionTest
Error:  testReacquireLeadershipAfterSessionLost(org.apache.pulsar.metadata.ZKSessionTest)  Time elapsed: 15.396 s  <<< FAILURE!
org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a lambda expression in org.apache.pulsar.metadata.ZKSessionTest that uses org.apache.pulsar.metadata.api.coordination.LeaderElection expected [Leading] but found [NoLeader] within 10 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:165)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:895)
	at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:679)
	at org.apache.pulsar.metadata.ZKSessionTest.testReacquireLeadershipAfterSessionLost(ZKSessionTest.java:174)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.AssertionError: expected [Leading] but found [NoLeader]
	at org.testng.Assert.fail(Assert.java:99)
	at org.testng.Assert.failNotEquals(Assert.java:1037)
	at org.testng.Assert.assertEqualsImpl(Assert.java:140)
	at org.testng.Assert.assertEquals(Assert.java:122)
	at org.testng.Assert.assertEquals(Assert.java:617)
	at org.apache.pulsar.metadata.ZKSessionTest.lambda$testReacquireLeadershipAfterSessionLost$1(ZKSessionTest.java:175)
	at org.awaitility.core.AssertionCondition.lambda$new$0(AssertionCondition.java:53)
	at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:222)
	at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:209)
	... 4 more
@lhotari
Copy link
Member Author

lhotari commented Jun 23, 2021

Another exception in the same test

Error:  Tests run: 6, Failures: 1, Errors: 0, Skipped: 4, Time elapsed: 17.285 s <<< FAILURE! - in org.apache.pulsar.metadata.ZKSessionTest
Error:  testReacquireLeadershipAfterSessionLost(org.apache.pulsar.metadata.ZKSessionTest)  Time elapsed: 3.037 s  <<< FAILURE!
java.lang.AssertionError: expected [null] but found [NoLeader]
	at org.testng.Assert.fail(Assert.java:99)
	at org.testng.Assert.failNotSame(Assert.java:1033)
	at org.testng.Assert.assertNull(Assert.java:965)
	at org.testng.Assert.assertNull(Assert.java:953)
	at org.apache.pulsar.metadata.ZKSessionTest.testReacquireLeadershipAfterSessionLost(ZKSessionTest.java:179)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

@lhotari
Copy link
Member Author

lhotari commented Jun 23, 2021

another failure in testReacquireLocksAfterSessionLost method

Error:  Tests run: 6, Failures: 1, Errors: 0, Skipped: 3, Time elapsed: 75.825 s <<< FAILURE! - in org.apache.pulsar.metadata.ZKSessionTest
Error:  testReacquireLocksAfterSessionLost(org.apache.pulsar.metadata.ZKSessionTest)  Time elapsed: 24.543 s  <<< FAILURE!
java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /key-1455163628288
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
	at java.base/java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:777)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
	at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$storeDelete$21(ZKMetadataStore.java:277)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /key-1455163628288
	at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:314)
	... 5 more
Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /key-1455163628288
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:134)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
	at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:304)
	... 5 more

@mattisonchao
Copy link
Member

I want to fix it. :)

sijie pushed a commit that referenced this issue Jul 14, 2021
Fixes #11032 #11143

Master Issue: #11032 #11143

Motivation
In some cases, testing errors occurred due to Zookeeper's notification delay.

Modifications
Waiting async notice until timeout.
@lhotari
Copy link
Member Author

lhotari commented Aug 24, 2021

This test remains flaky. Reopening this issue

Here's a recent failure:

Error:  testReacquireLocksAfterSessionLost(org.apache.pulsar.metadata.ZKSessionTest)  Time elapsed: 17.782 s  <<< FAILURE!
org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a lambda expression in org.apache.pulsar.metadata.ZKSessionTest that uses org.apache.pulsar.metadata.api.coordination.ResourceLock expected [false] but found [true] within 10 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:165)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:895)
	at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:679)
	at org.apache.pulsar.metadata.ZKSessionTest.testReacquireLocksAfterSessionLost(ZKSessionTest.java:130)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.AssertionError: expected [false] but found [true]
	at org.testng.Assert.fail(Assert.java:99)
	at org.testng.Assert.failNotEquals(Assert.java:1037)
	at org.testng.Assert.assertFalse(Assert.java:67)
	at org.testng.Assert.assertFalse(Assert.java:77)
	at org.apache.pulsar.metadata.ZKSessionTest.lambda$testReacquireLocksAfterSessionLost$0(ZKSessionTest.java:131)
	at org.awaitility.core.AssertionCondition.lambda$new$0(AssertionCondition.java:53)
	at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:222)
	at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:209)
	... 4 more

this happened in https://github.com/apache/pulsar/pull/11758/checks?check_run_id=3408040594#step:8:4706

@lhotari lhotari reopened this Aug 24, 2021
@lhotari
Copy link
Member Author

lhotari commented Aug 24, 2021

There's also expected [SessionReestablished] but found [Reconnected] type of assertion failures

Error:  Tests run: 6, Failures: 1, Errors: 0, Skipped: 3, Time elapsed: 43.614 s <<< FAILURE! - in org.apache.pulsar.metadata.ZKSessionTest
Error:  testReacquireLocksAfterSessionLost(org.apache.pulsar.metadata.ZKSessionTest)  Time elapsed: 8.149 s  <<< FAILURE!
java.lang.AssertionError: expected [SessionReestablished] but found [Reconnected]
	at org.testng.Assert.fail(Assert.java:99)
	at org.testng.Assert.failNotEquals(Assert.java:1037)
	at org.testng.Assert.assertEqualsImpl(Assert.java:140)
	at org.testng.Assert.assertEquals(Assert.java:122)
	at org.testng.Assert.assertEquals(Assert.java:617)
	at org.apache.pulsar.metadata.ZKSessionTest.testReacquireLocksAfterSessionLost(ZKSessionTest.java:128)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

in https://github.com/apache/pulsar/pull/11758/checks?check_run_id=3408040594#step:8:3079

@lhotari
Copy link
Member Author

lhotari commented Aug 24, 2021

also

Error:  Tests run: 6, Failures: 1, Errors: 0, Skipped: 3, Time elapsed: 30.201 s <<< FAILURE! - in org.apache.pulsar.metadata.ZKSessionTest
Error:  testReacquireLocksAfterSessionLost(org.apache.pulsar.metadata.ZKSessionTest)  Time elapsed: 12.986 s  <<< FAILURE!
org.awaitility.core.ConditionTimeoutException: Assertion condition defined as a lambda expression in org.apache.pulsar.metadata.ZKSessionTest that uses org.apache.pulsar.metadata.api.coordination.ResourceLock expected [false] but found [true] within 10 seconds.
	at org.awaitility.core.ConditionAwaiter.await(ConditionAwaiter.java:165)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:119)
	at org.awaitility.core.AssertionCondition.await(AssertionCondition.java:31)
	at org.awaitility.core.ConditionFactory.until(ConditionFactory.java:895)
	at org.awaitility.core.ConditionFactory.untilAsserted(ConditionFactory.java:679)
	at org.apache.pulsar.metadata.ZKSessionTest.testReacquireLocksAfterSessionLost(ZKSessionTest.java:130)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
	at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73)
	at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.AssertionError: expected [false] but found [true]
	at org.testng.Assert.fail(Assert.java:99)
	at org.testng.Assert.failNotEquals(Assert.java:1037)
	at org.testng.Assert.assertFalse(Assert.java:67)
	at org.testng.Assert.assertFalse(Assert.java:77)
	at org.apache.pulsar.metadata.ZKSessionTest.lambda$testReacquireLocksAfterSessionLost$0(ZKSessionTest.java:131)
	at org.awaitility.core.AssertionCondition.lambda$new$0(AssertionCondition.java:53)
	at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:222)
	at org.awaitility.core.ConditionAwaiter$ConditionPoller.call(ConditionAwaiter.java:209)
	... 4 more

lhotari added a commit to lhotari/pulsar that referenced this issue Aug 25, 2021
…ate between retries

- mitigates flaky tests in LockManagerTest apache#11690 and ZkSessionTest apache#11032
@lhotari
Copy link
Member Author

lhotari commented Aug 25, 2021

there's a mitigation for this flaky test in #11771 by using a solution which cleans up state between test retries.

eolivelli pushed a commit that referenced this issue Aug 25, 2021
…ate between retries (#11771)

* [Tests] Use TestRetrySupport for BaseMetadataStoreTests to cleanup state between retries

- mitigates flaky tests in LockManagerTest #11690 and ZkSessionTest #11032

* [Tests] Add notes about the correct usage of TestRetrySupport
eolivelli pushed a commit to datastax/pulsar that referenced this issue Aug 25, 2021
…ate between retries (apache#11771)

* [Tests] Use TestRetrySupport for BaseMetadataStoreTests to cleanup state between retries

- mitigates flaky tests in LockManagerTest apache#11690 and ZkSessionTest apache#11032

* [Tests] Add notes about the correct usage of TestRetrySupport

(cherry picked from commit 7de9992)
@lhotari
Copy link
Member Author

lhotari commented Sep 2, 2021

fixed by #11886

@lhotari lhotari closed this as completed Sep 2, 2021
codelipenghui pushed a commit that referenced this issue Sep 9, 2021
…ate between retries (#11771)

* [Tests] Use TestRetrySupport for BaseMetadataStoreTests to cleanup state between retries

- mitigates flaky tests in LockManagerTest #11690 and ZkSessionTest #11032

* [Tests] Add notes about the correct usage of TestRetrySupport

(cherry picked from commit 7de9992)
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this issue Mar 18, 2022
…#11062)

Fixes apache#11032 apache#11143

Master Issue: apache#11032 apache#11143

Motivation
In some cases, testing errors occurred due to Zookeeper's notification delay.

Modifications
Waiting async notice until timeout.
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this issue Mar 18, 2022
…ate between retries (apache#11771)

* [Tests] Use TestRetrySupport for BaseMetadataStoreTests to cleanup state between retries

- mitigates flaky tests in LockManagerTest apache#11690 and ZkSessionTest apache#11032

* [Tests] Add notes about the correct usage of TestRetrySupport
@lhotari lhotari reopened this Mar 9, 2023
@lhotari
Copy link
Member Author

lhotari commented Mar 9, 2023

related to #19754 . @mattisonchao please remove ZKSessionTest from quarantine and fix this. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants