Flaky test `BaseFailureRecoveryTest.testParallel`: "There should be no remaining tmp_trino tables that are queryable" #16277

ebyhr · 2023-02-27T09:53:23Z

https://github.com/trinodb/trino/actions/runs/4279764588/jobs/7451040205

Error:    TestBigQueryQueryFailureRecoveryTest>BaseFailureRecoveryTest.testParallel:218->BaseFailureRecoveryTest.testInsert:235->BaseFailureRecoveryTest.testTableModification:333->BaseFailureRecoveryTest.testTableModification:338->BaseFailureRecoveryTest.testNonSelect:367->BaseFailureRecoveryTest.lambda$testNonSelect$10:367 
Expecting throwable message:
  <"[There should be no remaining tmp_trino_ab7ec455_* tables. They are: [tmp_trino_ab7ec455_517f28a7]] 
Expecting:
 <1>
to be equal to:
 <0>
but was not.">
to contain:
  <"This error is injected by the failure injection service">
but did not.

Throwable that failed the check:

java.lang.AssertionError: [There should be no remaining tmp_trino_ab7ec455_* tables. They are: [tmp_trino_ab7ec455_517f28a7]] 
Expecting:
 <1>
to be equal to:
 <0>
but was not.
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.execute(BaseFailureRecoveryTest.java:531)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.executeActual(BaseFailureRecoveryTest.java:496)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.executeActualNoRetries(BaseFailureRecoveryTest.java:481)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.lambda$failsWithoutRetries$12(BaseFailureRecoveryTest.java:673)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.failsWithoutRetries(BaseFailureRecoveryTest.java:673)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.failsAlways(BaseFailureRecoveryTest.java:664)
	at io.trino.testing.BaseFailureRecoveryTest.testNonSelect(BaseFailureRecoveryTest.java:367)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:338)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:333)
	at io.trino.testing.BaseFailureRecoveryTest.testInsert(BaseFailureRecoveryTest.java:235)
	at io.trino.testing.BaseFailureRecoveryTest$ParallelTestRunnable.run(BaseFailureRecoveryTest.java:850)
	at io.trino.testing.BaseFailureRecoveryTest.testParallel(BaseFailureRecoveryTest.java:218)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
	at org.testng.internal.TestMethodWithDataProviderMethodWorker.call(TestMethodWithDataProviderMethodWorker.java:75)
	at org.testng.internal.TestMethodWithDataProviderMethodWorker.call(TestMethodWithDataProviderMethodWorker.java:14)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

The text was updated successfully, but these errors were encountered:

ebyhr · 2023-02-27T09:54:41Z

cc: @mwd410 @hashhar

ebyhr · 2023-03-02T06:56:53Z

TestMySqlQueryFailureRecoveryTest.testParallel also failed.
https://github.com/trinodb/trino/actions/runs/4301647111/jobs/7502205035

hashhar · 2023-03-03T07:03:19Z

I suspect a query id hash collision, let me see if changing the temp table name generation helps.

hashhar · 2023-03-03T12:45:13Z

testCreateTable, testInsert and testExplainAnalyze seem to the ones which fail more frequently.

At the moment since I can't reproduce this at will I had to look at the code and make guesses. The one thing which caught my attention is that we use a Closer to drop tables and Closer suppresses any exceptions thrown when calling close on the resource being closed so it's possible we're not seeing the actual failures during dropTable.

cc: @mwd410 for any ideas

hashhar · 2023-03-03T14:08:58Z

Found the problem, MySQL doesn't have a strongly consistent information schema so sometimes tables appear to exist even after they've been dropped. I verified this by checking if the table that the test complains was not cleaned up actually exists or not. In none of the cases I was able to find the table which the test claims was not cleaned up.

I'll check if an assertEventually helps.

hashhar · 2023-03-04T19:21:35Z

Well, MySQL's information_schema seems to act horribly under contention.

Here's some logs.

1st Case

Table appears to exist as seen from Trino.
Query directly against MySQL shows table doesn't exist anymore.

Note that it can take multiple retries from Trino for table to be seen as deleted.

2023-03-04T12:58:21.127-0600	ERROR	PoolService-1	FLAKY_FTE_111bc028	Trino: leftover tables: 1; prefix: tmp_trino_111bc028_
2023-03-04T12:58:21.127-0600	ERROR	PoolService-1	FLAKY_FTE_111bc028	MySQL: information_schema for prefix tmp_trino_111bc028_:
...
2023-03-04T12:58:58.140-0600	ERROR	PoolService-3	FLAKY_FTE_058ade3b	Trino: leftover tables: 1; prefix: tmp_trino_058ade3b_
2023-03-04T12:58:58.140-0600	ERROR	PoolService-3	FLAKY_FTE_058ade3b	MySQL: information_schema for prefix tmp_trino_058ade3b_:

Case 2

Table seen as exists on Trino.
Information_schema on MySQL also shows it exists but a query against the table fails.
Now listing from Trino shows the table is gone.

2023-03-04T13:08:48.232-0600	ERROR	PoolService-9	FLAKY_FTE_254e92ae	Trino: leftover tables: 1; prefix: tmp_trino_254e92ae_
2023-03-04T13:08:48.232-0600	ERROR	PoolService-9	FLAKY_FTE_254e92ae	MySQL: information_schema for prefix tmp_trino_254e92ae_: tmp_trino_254e92ae_b1402314
2023-03-04T13:08:48.331-0600	ERROR	PoolService-9	FLAKY_FTE_254e92ae	MySQL: tmp_trino_254e92ae_b1402314 does not exist; error: java.sql.SQLSyntaxErrorException: Table 'tpch.tmp_trino_254e92ae_b1402314' doesn't exist [statement:"SELECT count(1) FROM tmp_trino_254e92ae_b1402314", arguments:{positional:{}, named:{}, finder:[]}]
2023-03-04T13:08:48.521-0600	ERROR	PoolService-9	FLAKY_FTE_254e92ae	Trino: information_schema for prefix tmp_trino_254e92ae_:
...
2023-03-04T13:09:26.991-0600	ERROR	PoolService-0	FLAKY_FTE_ac10452c	Trino: leftover tables: 1; prefix: tmp_trino_ac10452c_
2023-03-04T13:09:26.991-0600	ERROR	PoolService-0	FLAKY_FTE_ac10452c	MySQL: information_schema for prefix tmp_trino_ac10452c_: tmp_trino_ac10452c_972a56ba
2023-03-04T13:09:27.101-0600	ERROR	PoolService-0	FLAKY_FTE_ac10452c	MySQL: tmp_trino_ac10452c_972a56ba does not exist; error: java.sql.SQLSyntaxErrorException: Table 'tpch.tmp_trino_ac10452c_972a56ba' doesn't exist [statement:"SELECT count(1) FROM tmp_trino_ac10452c_972a56ba", arguments:{positional:{}, named:{}, finder:[]}]
2023-03-04T13:09:27.289-0600	ERROR	PoolService-0	FLAKY_FTE_ac10452c	Trino: information_schema for prefix tmp_trino_ac10452c_:

Possible solutions

assertEventually doesn't help always, test is still flaky though lesser chance of occurrence.

Reducing concurrency also doesn't seem to help with the failure.

Alternative tests

Instead of observing the database state we can try to instead check whether we issued DROP TABLE statements for all temporary tables. This needs us to implement a generic mechanism to record queries regardless of connector (BaseJdbc vs others).

@mwd410 @losipiuk Do you have any ideas how to address this? This will be a problem with any system which doesn't have a strongly consistent information_schema like MySQL, SingleStore and maybe others.

hashhar · 2023-03-04T19:31:04Z

Also, maybe we can disable the validation by default but keep the code in there so we can test the cleanup manually when adding FTE to some new connector only.

kokosing · 2023-03-06T11:31:03Z

This will be a problem with any system which doesn't have a strongly consistent information_schema like MySQL, SingleStore and maybe others.

I think that since it is remote database specific, we need to handle this by some abstraction that will be implemented per each remote database (connector). And in some cases where information_schema is very unreliable we need to accept that we could consider skipping that assertions. Hopefully assertEventually in few of them would be good enough solution.

losipiuk · 2023-03-08T14:46:39Z

This will be a problem with any system which doesn't have a strongly consistent information_schema like MySQL, SingleStore and maybe others.

I think that since it is remote database specific, we need to handle this by some abstraction that will be implemented per each remote database (connector). And in some cases where information_schema is very unreliable we need to accept that we could consider skipping that assertions. Hopefully assertEventually in few of them would be good enough solution.

Yeah - I agree. Maybe we can just expose protected assertNoTmpTablesLeftBehind() with default implementation and override where neeeded.

losipiuk · 2023-03-10T10:29:20Z

@mwd410 can you look into this one?

mwd410 · 2023-03-10T15:07:48Z

We could, instead of simply relying on show tables like to return no entries, we could do

assertThatThrownBy(() -> getQueryRunner().execute("SELECT 1 FROM %s WHERE 1 = 0".formatted(tempTableName)))
                        .hasMessageContaining("%s does not exist", tempTableName);

Thoughts?

mwd410 · 2023-03-10T15:39:34Z

I have opened #16499 to address this.

mwd410 · 2023-03-16T16:01:54Z

@ebyhr - i merged my pull request the other day - have se seen any of these errors since then? If not can we resolve this issue?

ebyhr · 2023-03-16T21:34:04Z

@mwd410 Thanks! Let's close and reopen if needed.

anusudarsan · 2023-03-23T15:33:11Z

@mwd410 saw this today here https://github.com/starburstdata/starburst-trino-plugins/pull/156/files#diff-b1ef0b3848be78431e2c376255c12d7b98cde03f985e75f00b0a331b5bd836af. That version should have your fix right?

findepi · 2023-03-24T12:34:47Z

https://github.com/trinodb/trino/actions/runs/4509499706/jobs/7939611131?pr=16707

Error:  io.trino.plugin.postgresql.TestPostgresQueryFailureRecoveryTest.testParallel[testCreateTable](8)  Time elapsed: 57.04 s  <<< FAILURE!
java.lang.AssertionError: expected [1] but found [2]
	at org.testng.Assert.fail(Assert.java:94)
	at org.testng.Assert.failNotEquals(Assert.java:513)
	at org.testng.Assert.assertEqualsImpl(Assert.java:135)
	at org.testng.Assert.assertEquals(Assert.java:116)
	at org.testng.Assert.assertEquals(Assert.java:389)
	at org.testng.Assert.assertEquals(Assert.java:399)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.finishesSuccessfully(BaseFailureRecoveryTest.java:651)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.finishesSuccessfully(BaseFailureRecoveryTest.java:641)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.finishesSuccessfully(BaseFailureRecoveryTest.java:631)
	at io.trino.testing.BaseFailureRecoveryTest.testNonSelect(BaseFailureRecoveryTest.java:389)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:348)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:343)
	at io.trino.testing.BaseFailureRecoveryTest.testCreateTable(BaseFailureRecoveryTest.java:237)
	at io.trino.testing.BaseFailureRecoveryTest$ParallelTestRunnable.run(BaseFailureRecoveryTest.java:905)
	at io.trino.testing.BaseFailureRecoveryTest.testParallel(BaseFailureRecoveryTest.java:228)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
	at org.testng.internal.TestMethodWithDataProviderMethodWorker.call(TestMethodWithDataProviderMethodWorker.java:75)
	at org.testng.internal.TestMethodWithDataProviderMethodWorker.call(TestMethodWithDataProviderMethodWorker.java:14)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

cc @mwd410 @arhimondr @losipiuk @linzebing

arhimondr · 2023-03-24T14:27:07Z

@mwd410 Could you please take a look?

mwd410 · 2023-05-15T14:40:23Z

From a cursory glance, this appears to be due to Oracle's "Recycling Bin" mechanism. I have opened a pr to address this issue, where I also give further explanation.

@findepi @ebyhr

raunaqmorarka · 2023-05-18T02:27:54Z

Still encountering this after latest fix
https://github.com/trinodb/trino/actions/runs/5004142646/jobs/8966655659?pr=17212

io.trino.plugin.oracle.TestOracleQueryFailureRecoveryTest.testParallel[testCreateTable](9)  Time elapsed: 54.428 s  <<< FAILURE!
org.junit.ComparisonFailure: 
[There should be no remaining tmp_trino tables that are queryable. They are:
	For queryId [20230517_144752_00198_biqit] (prefix [tmp_trino_9d29747f_]) remaining tables: [tmp_trino_9d29747f_06f5762c]
		With errors: [
			Expecting code to raise a throwable.]] 
Expecting value to be true but was false expected:<[tru]e> but was:<[fals]e>
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
	at io.trino.testing.BaseFailureRecoveryTest.checkTemporaryTables(BaseFailureRecoveryTest.java:463)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.cleansUpTemporaryTables(BaseFailureRecoveryTest.java:624)
	at io.trino.testing.BaseFailureRecoveryTest.testNonSelect(BaseFailureRecoveryTest.java:409)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:347)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:342)
	at io.trino.testing.BaseFailureRecoveryTest.testCreateTable(BaseFailureRecoveryTest.java:236)
	at io.trino.testing.BaseFailureRecoveryTest$ParallelTestRunnable.run(BaseFailureRecoveryTest.java:910)
	at io.trino.testing.BaseFailureRecoveryTest.testParallel(BaseFailureRecoveryTest.java:227)

mwd410 · 2023-05-19T20:10:53Z

@raunaqmorarka - OK, I spent all day trying to reproduce this locally, and then looked closer at the output and found this. Documentation suggests it means the table's lock is already held by a different query, but I have no idea what that might be - any ideas? This will require more investigation next week, but at least we have a thread to pull on.

Note: I also see the same error in the previous run that re-opened this issue for the same reason, so I think my previously added PURGE clause in the drop table was unnecessary (though probably wise to keep nonetheless)

2023-05-17T08:47:52.722-0600	ERROR	transaction-finishing-0	io.trino.metadata.CatalogMetadata	Connector threw exception on abort
io.trino.spi.TrinoException: ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1201)
	at io.trino.plugin.oracle.OracleClient.dropTable(OracleClient.java:310)
	at io.trino.plugin.jdbc.BaseJdbcClient.rollbackCreateTable(BaseJdbcClient.java:1050)
	at io.trino.plugin.jdbc.ForwardingJdbcClient.rollbackCreateTable(ForwardingJdbcClient.java:249)
	at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.lambda$rollbackCreateTable$36(StatisticsAwareJdbcClient.java:330)
	at io.trino.plugin.jdbc.jmx.JdbcApiStats.wrap(JdbcApiStats.java:47)
	at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.rollbackCreateTable(StatisticsAwareJdbcClient.java:330)
	at io.trino.plugin.jdbc.CachingJdbcClient.rollbackCreateTable(CachingJdbcClient.java:392)
	at io.trino.plugin.jdbc.CachingJdbcClient.rollbackCreateTable(CachingJdbcClient.java:392)
	at io.trino.plugin.jdbc.DefaultJdbcMetadata.lambda$beginCreateTable$16(DefaultJdbcMetadata.java:778)
	at java.base/java.util.Optional.ifPresent(Optional.java:178)
	at io.trino.plugin.jdbc.DefaultJdbcMetadata.rollback(DefaultJdbcMetadata.java:841)
	at io.trino.plugin.jdbc.JdbcTransactionManager.rollback(JdbcTransactionManager.java:64)
	at io.trino.plugin.jdbc.JdbcConnector.rollback(JdbcConnector.java:109)
	at io.trino.metadata.CatalogTransaction.abort(CatalogTransaction.java:93)
	at io.trino.metadata.CatalogMetadata.safeAbort(CatalogMetadata.java:167)
	at io.trino.metadata.CatalogMetadata.abort(CatalogMetadata.java:161)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
	at io.trino.$gen.Trino_testversion____20230517_144649_596.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.sql.SQLException: ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:629)
	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:563)
	at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1230)
	at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:771)
	at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:298)
	at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:511)
	at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:122)
	at oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1199)
	at oracle.jdbc.driver.OracleStatement.executeSQLStatement(OracleStatement.java:1819)
	at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1471)
	at oracle.jdbc.driver.OracleStatement.executeInternal(OracleStatement.java:2504)
	at oracle.jdbc.driver.OracleStatement.execute(OracleStatement.java:2459)
	at oracle.jdbc.driver.OracleStatementWrapper.execute(OracleStatementWrapper.java:327)
	at oracle.ucp.jdbc.proxy.oracle$1ucp$1jdbc$1proxy$1oracle$1StatementProxy$2oracle$1jdbc$1internal$1OracleStatement$$$Proxy.execute(Unknown Source)
	at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1211)
	at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1198)
	... 25 more
	Suppressed: java.lang.RuntimeException: Query: DROP TABLE "TRINO_TEST"."TMP_TRINO_9D29747F_06F5762C" PURGE
		at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1214)
		... 26 more
Caused by: Error : 54, Position : 24, Sql = DROP TABLE "TRINO_TEST"."TMP_TRINO_9D29747F_06F5762C" PURGE, OriginalSql = DROP TABLE "TRINO_TEST"."TMP_TRINO_9D29747F_06F5762C" PURGE, Error Msg = ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:636)
	... 40 more

mwd410 · 2023-05-19T20:20:36Z

Ah, according to this, I believe the solution is to lock the table ahead of dropping it. Apparently, by default, oracle does not wait for other transactions on a table to complete before acquiring its lock to drop it.

I suspect this is caused by a race condition between injected errors while some other thread is inserting into the table (e.g. a JdbcPageSink), then we try to rollbackCreateTable while that other transaction is still running. Note: the above trace does include rollbackCreateTable.

findepi · 2023-05-26T13:13:44Z

https://github.com/trinodb/trino/actions/runs/5089043624/jobs/9146290943?pr=17650

mwd410 · 2023-05-26T13:47:28Z

sorry for the delay on this - there was some review on my PR which I just got in moments ago.

findepi · 2023-06-02T11:15:51Z

https://github.com/trinodb/trino/actions/runs/5141998837/jobs/9255271329?pr=17722

Error:  Failures: 
Error:    TestOracleQueryFailureRecoveryTest>BaseFailureRecoveryTest.testParallel:227->BaseFailureRecoveryTest.testInsert:244->BaseFailureRecoveryTest.testTableModification:342->BaseFailureRecoveryTest.testTableModification:347->BaseFailureRecoveryTest.testNonSelect:409->BaseFailureRecoveryTest.checkTemporaryTables:463 [There should be no remaining tmp_trino tables that are queryable. They are:
	For queryId [20230601_084133_00223_u7vng] (prefix [tmp_trino_ec95e300_]) remaining tables: [tmp_trino_ec95e300_4b157d1c]
		With errors: [
			Expecting code to raise a throwable.]] 
Expecting value to be true but was false expected:<[tru]e> but was:<[fals]e>

findepi · 2023-06-02T11:16:00Z

@mwd410 what's the status of this?

losipiuk · 2023-06-05T08:59:42Z

There was hope it would get fixed with: #17575. cc: @mwd410

Apparently did not help.
We are still getting

io.trino.spi.TrinoException: ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at io.trino.plugin.oracle.OracleClient.dropTable(OracleClient.java:321)
	at io.trino.plugin.jdbc.BaseJdbcClient.rollbackCreateTable(BaseJdbcClient.java:1050)
	at io.trino.plugin.jdbc.ForwardingJdbcClient.rollbackCreateTable(ForwardingJdbcClient.java:249)
	at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.lambda$rollbackCreateTable$36(StatisticsAwareJdbcClient.java:330)
	at io.trino.plugin.jdbc.jmx.JdbcApiStats.wrap(JdbcApiStats.java:47)
	at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.rollbackCreateTable(StatisticsAwareJdbcClient.java:330)
	at io.trino.plugin.jdbc.CachingJdbcClient.rollbackCreateTable(CachingJdbcClient.java:392)
	at io.trino.plugin.jdbc.CachingJdbcClient.rollbackCreateTable(CachingJdbcClient.java:392)
	at io.trino.plugin.jdbc.DefaultJdbcMetadata.lambda$beginInsert$20(DefaultJdbcMetadata.java:853)
	at java.base/java.util.Optional.ifPresent(Optional.java:178)
	at io.trino.plugin.jdbc.DefaultJdbcMetadata.rollback(DefaultJdbcMetadata.java:841)
	at io.trino.plugin.jdbc.JdbcTransactionManager.rollback(JdbcTransactionManager.java:64)
	at io.trino.plugin.jdbc.JdbcConnector.rollback(JdbcConnector.java:109)
	at io.trino.metadata.CatalogTransaction.abort(CatalogTransaction.java:93)
	at io.trino.metadata.CatalogMetadata.safeAbort(CatalogMetadata.java:167)
	at io.trino.metadata.CatalogMetadata.abort(CatalogMetadata.java:161)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
	at io.trino.$gen.Trino_testversion____20230601_084025_3091.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.sql.SQLException: ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:629)
	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:563)
	at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1230)
	at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:771)
	at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:298)
	at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:511)
	at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:122)
	at oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1199)
	at oracle.jdbc.driver.OracleStatement.executeSQLStatement(OracleStatement.java:1819)
	at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1471)
	at oracle.jdbc.driver.OracleStatement.executeInternal(OracleStatement.java:2504)
	at oracle.jdbc.driver.OracleStatement.execute(OracleStatement.java:2459)
	at oracle.jdbc.driver.OracleStatementWrapper.execute(OracleStatementWrapper.java:327)
	at oracle.ucp.jdbc.proxy.oracle$1ucp$1jdbc$1proxy$1oracle$1StatementProxy$2oracle$1jdbc$1internal$1OracleStatement$$$Proxy.execute(Unknown Source)
	at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1211)
	at io.trino.plugin.oracle.OracleClient.dropTable(OracleClient.java:318)
	... 24 more
	Suppressed: java.lang.RuntimeException: Query: DROP TABLE "TRINO_TEST"."TMP_TRINO_EC95E300_4B157D1C" PURGE
		at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1214)
		... 25 more
Caused by: Error : 54, Position : 24, Sql = DROP TABLE "TRINO_TEST"."TMP_TRINO_EC95E300_4B157D1C" PURGE, OriginalSql = DROP TABLE "TRINO_TEST"."TMP_TRINO_EC95E300_4B157D1C" PURGE, Error Msg = ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:636)
	... 39 more

mwd410 · 2023-06-05T13:44:51Z

hmm. i have one thread to pull on. looking

mwd410 · 2023-06-05T14:11:40Z

Oh the perils of forgetting about auto commit. 🤦

losipiuk · 2023-06-06T21:30:35Z

Merged #17756 (thanks @mwd410 ). Hopefully it will solve the issue.

findepi · 2023-06-07T12:35:58Z

🤞

ebyhr · 2023-06-08T01:05:35Z

Let me reopen because the test is still failing on master.

Error:  Failures: 
Error:    TestOracleQueryFailureRecoveryTest>BaseFailureRecoveryTest.testParallel:227->BaseFailureRecoveryTest.testExplainAnalyze:320->BaseFailureRecoveryTest.testTableModification:342->BaseFailureRecoveryTest.testTableModification:347->BaseFailureRecoveryTest.testNonSelect:409->BaseFailureRecoveryTest.checkTemporaryTables:463 [There should be no remaining tmp_trino tables that are queryable. They are:
	For queryId [20230607_221201_00339_7dv8e] (prefix [tmp_trino_68459c86_]) remaining tables: [tmp_trino_68459c86_f31eaf34]
		With errors: [
			Expecting code to raise a throwable.]] 
Expecting value to be true but was false expected:<[tru]e> but was:<[fals]e>

https://github.com/trinodb/trino/actions/runs/5203676867/jobs/9388677568

mwd410 · 2023-06-09T21:07:00Z

i'm kind of at a loss here. is it possibly that we are

- opening a connection
  - creating first statement, acquiring lock, closing first statement
  - creating second statement, dropping table, closing second statement
- closing connection

whereas perhaps we should be

- opening a connection
  - creating first statement
  - creating second statement
  - acquiring lock
  - dropping table
  - closing first and second statement
- closing connection

I don't have any other ideas. This doesn't seem to be behaving as documented or how i understand it should behave. Am i missing something?

losipiuk · 2023-06-12T08:19:41Z

IDK what is happening here but maybe we can try to add some extra loggin (e.g. dump locks information before executing DROP STATEMENT). Maybe it will shed some light.
It is happening fairly often so you should be able to easily reporduce that on CI with modified ci.yml so you run just impacted job but many times @mwd410

findepi · 2023-06-12T11:34:07Z

in the meantime, would it make sense to mark TestOracleQueryFailureRecoveryTest.testParallel as @Flaky @losipiuk @mwd410 ?

losipiuk · 2023-06-12T11:46:31Z

in the meantime, would it make sense to mark TestOracleQueryFailureRecoveryTest.testParallel as @flaky @losipiuk @mwd410 ?

Good idea. @mwd410 please send the PR

ebyhr added bug Something isn't working test bigquery BigQuery connector labels Feb 27, 2023

ebyhr changed the title ~~Flaky TestBigQueryQueryFailureRecoveryTest.testParallel~~ Flaky BaseFailureRecoveryTest.testParallel Mar 2, 2023

ebyhr mentioned this issue Mar 2, 2023

Add TestAddColumnTask and TestDropColumnTask #16324

Merged

guiyanakuang mentioned this issue Mar 3, 2023

Attempt to evaluate the value of TryFunction in ExpressionInterpreter #16357

Closed

losipiuk assigned mwd410 Mar 10, 2023

mwd410 mentioned this issue Mar 10, 2023

Fix BaseFailureRecoveryTest temp table checking flakiness #16277 #16499

Merged

ebyhr closed this as completed Mar 16, 2023

anusudarsan reopened this Mar 23, 2023

findepi mentioned this issue Mar 24, 2023

Refactor testRenameTableToLongTableName to remove Iceberg override #16707

Merged

losipiuk mentioned this issue May 15, 2023

Prefer killing speculative tasks #17478

Merged

phd3 mentioned this issue May 16, 2023

Isolate TestFileSystemCache to fix its flakiness #17467

Merged

mwd410 mentioned this issue May 19, 2023

Oracle waits for table lock before drop #17575

Merged

findepi mentioned this issue May 26, 2023

Fix incorrect @BeforeTest usages #17650

Merged

findepi mentioned this issue Jun 2, 2023

Remove defunct query.remote-task.min-error-duration configuration #17722

Merged

mwd410 mentioned this issue Jun 5, 2023

Oracle turns off auto commit to drop temporary table #17756

Merged

findepi closed this as completed Jun 7, 2023

ebyhr reopened this Jun 8, 2023

hashhar mentioned this issue Jun 9, 2023

Revert "Remove unnecessary separate CI job for Redshift" #17824

Merged

mwd410 mentioned this issue Jun 12, 2023

Mark Oracle FailureRecovery tests as flaky #17860

Merged

chenjian2664 mentioned this issue Jun 14, 2023

Fix delete target table captured as input in event listener #16887

Closed

losipiuk closed this as completed Jun 14, 2023

Flaky test BaseFailureRecoveryTest.testParallel: "There should be no remaining tmp_trino tables that are queryable" #16277

Flaky test BaseFailureRecoveryTest.testParallel: "There should be no remaining tmp_trino tables that are queryable" #16277

Comments

ebyhr commented Feb 27, 2023

ebyhr commented Feb 27, 2023

ebyhr commented Mar 2, 2023

hashhar commented Mar 3, 2023

hashhar commented Mar 3, 2023

hashhar commented Mar 3, 2023

hashhar commented Mar 4, 2023

1st Case

Case 2

Possible solutions

Alternative tests

hashhar commented Mar 4, 2023

kokosing commented Mar 6, 2023

losipiuk commented Mar 8, 2023

losipiuk commented Mar 10, 2023

mwd410 commented Mar 10, 2023

mwd410 commented Mar 10, 2023

mwd410 commented Mar 16, 2023

ebyhr commented Mar 16, 2023

anusudarsan commented Mar 23, 2023

findepi commented Mar 24, 2023

arhimondr commented Mar 24, 2023

mwd410 commented May 15, 2023

raunaqmorarka commented May 18, 2023

mwd410 commented May 19, 2023

mwd410 commented May 19, 2023

findepi commented May 26, 2023

mwd410 commented May 26, 2023

findepi commented Jun 2, 2023

findepi commented Jun 2, 2023

losipiuk commented Jun 5, 2023 • edited Loading

mwd410 commented Jun 5, 2023

mwd410 commented Jun 5, 2023

losipiuk commented Jun 6, 2023

findepi commented Jun 7, 2023

ebyhr commented Jun 8, 2023

mwd410 commented Jun 9, 2023

losipiuk commented Jun 12, 2023

findepi commented Jun 12, 2023

losipiuk commented Jun 12, 2023

Flaky test `BaseFailureRecoveryTest.testParallel`: "There should be no remaining tmp_trino tables that are queryable" #16277

Flaky test `BaseFailureRecoveryTest.testParallel`: "There should be no remaining tmp_trino tables that are queryable" #16277

losipiuk commented Jun 5, 2023 •

edited

Loading