Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test BaseFailureRecoveryTest.testParallel: "There should be no remaining tmp_trino tables that are queryable" #16277

Closed
ebyhr opened this issue Feb 27, 2023 · 42 comments
Assignees
Labels
bigquery BigQuery connector bug Something isn't working test

Comments

@ebyhr
Copy link
Member

ebyhr commented Feb 27, 2023

https://github.com/trinodb/trino/actions/runs/4279764588/jobs/7451040205

Error:    TestBigQueryQueryFailureRecoveryTest>BaseFailureRecoveryTest.testParallel:218->BaseFailureRecoveryTest.testInsert:235->BaseFailureRecoveryTest.testTableModification:333->BaseFailureRecoveryTest.testTableModification:338->BaseFailureRecoveryTest.testNonSelect:367->BaseFailureRecoveryTest.lambda$testNonSelect$10:367 
Expecting throwable message:
  <"[There should be no remaining tmp_trino_ab7ec455_* tables. They are: [tmp_trino_ab7ec455_517f28a7]] 
Expecting:
 <1>
to be equal to:
 <0>
but was not.">
to contain:
  <"This error is injected by the failure injection service">
but did not.

Throwable that failed the check:

java.lang.AssertionError: [There should be no remaining tmp_trino_ab7ec455_* tables. They are: [tmp_trino_ab7ec455_517f28a7]] 
Expecting:
 <1>
to be equal to:
 <0>
but was not.
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.execute(BaseFailureRecoveryTest.java:531)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.executeActual(BaseFailureRecoveryTest.java:496)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.executeActualNoRetries(BaseFailureRecoveryTest.java:481)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.lambda$failsWithoutRetries$12(BaseFailureRecoveryTest.java:673)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.failsWithoutRetries(BaseFailureRecoveryTest.java:673)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.failsAlways(BaseFailureRecoveryTest.java:664)
	at io.trino.testing.BaseFailureRecoveryTest.testNonSelect(BaseFailureRecoveryTest.java:367)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:338)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:333)
	at io.trino.testing.BaseFailureRecoveryTest.testInsert(BaseFailureRecoveryTest.java:235)
	at io.trino.testing.BaseFailureRecoveryTest$ParallelTestRunnable.run(BaseFailureRecoveryTest.java:850)
	at io.trino.testing.BaseFailureRecoveryTest.testParallel(BaseFailureRecoveryTest.java:218)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
	at org.testng.internal.TestMethodWithDataProviderMethodWorker.call(TestMethodWithDataProviderMethodWorker.java:75)
	at org.testng.internal.TestMethodWithDataProviderMethodWorker.call(TestMethodWithDataProviderMethodWorker.java:14)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
@ebyhr ebyhr added bug Something isn't working test bigquery BigQuery connector labels Feb 27, 2023
@ebyhr
Copy link
Member Author

ebyhr commented Feb 27, 2023

cc: @mwd410 @hashhar

@ebyhr ebyhr changed the title Flaky TestBigQueryQueryFailureRecoveryTest.testParallel Flaky BaseFailureRecoveryTest.testParallel Mar 2, 2023
@ebyhr
Copy link
Member Author

ebyhr commented Mar 2, 2023

TestMySqlQueryFailureRecoveryTest.testParallel also failed.
https://github.com/trinodb/trino/actions/runs/4301647111/jobs/7502205035

@hashhar
Copy link
Member

hashhar commented Mar 3, 2023

I suspect a query id hash collision, let me see if changing the temp table name generation helps.

@hashhar
Copy link
Member

hashhar commented Mar 3, 2023

testCreateTable, testInsert and testExplainAnalyze seem to the ones which fail more frequently.

At the moment since I can't reproduce this at will I had to look at the code and make guesses. The one thing which caught my attention is that we use a Closer to drop tables and Closer suppresses any exceptions thrown when calling close on the resource being closed so it's possible we're not seeing the actual failures during dropTable.

cc: @mwd410 for any ideas

@hashhar
Copy link
Member

hashhar commented Mar 3, 2023

Found the problem, MySQL doesn't have a strongly consistent information schema so sometimes tables appear to exist even after they've been dropped. I verified this by checking if the table that the test complains was not cleaned up actually exists or not. In none of the cases I was able to find the table which the test claims was not cleaned up.

I'll check if an assertEventually helps.

@hashhar
Copy link
Member

hashhar commented Mar 4, 2023

Well, MySQL's information_schema seems to act horribly under contention.

Here's some logs.

1st Case

Table appears to exist as seen from Trino.
Query directly against MySQL shows table doesn't exist anymore.

Note that it can take multiple retries from Trino for table to be seen as deleted.

2023-03-04T12:58:21.127-0600	ERROR	PoolService-1	FLAKY_FTE_111bc028	Trino: leftover tables: 1; prefix: tmp_trino_111bc028_
2023-03-04T12:58:21.127-0600	ERROR	PoolService-1	FLAKY_FTE_111bc028	MySQL: information_schema for prefix tmp_trino_111bc028_:
...
2023-03-04T12:58:58.140-0600	ERROR	PoolService-3	FLAKY_FTE_058ade3b	Trino: leftover tables: 1; prefix: tmp_trino_058ade3b_
2023-03-04T12:58:58.140-0600	ERROR	PoolService-3	FLAKY_FTE_058ade3b	MySQL: information_schema for prefix tmp_trino_058ade3b_:

Case 2

Table seen as exists on Trino.
Information_schema on MySQL also shows it exists but a query against the table fails.
Now listing from Trino shows the table is gone.

2023-03-04T13:08:48.232-0600	ERROR	PoolService-9	FLAKY_FTE_254e92ae	Trino: leftover tables: 1; prefix: tmp_trino_254e92ae_
2023-03-04T13:08:48.232-0600	ERROR	PoolService-9	FLAKY_FTE_254e92ae	MySQL: information_schema for prefix tmp_trino_254e92ae_: tmp_trino_254e92ae_b1402314
2023-03-04T13:08:48.331-0600	ERROR	PoolService-9	FLAKY_FTE_254e92ae	MySQL: tmp_trino_254e92ae_b1402314 does not exist; error: java.sql.SQLSyntaxErrorException: Table 'tpch.tmp_trino_254e92ae_b1402314' doesn't exist [statement:"SELECT count(1) FROM tmp_trino_254e92ae_b1402314", arguments:{positional:{}, named:{}, finder:[]}]
2023-03-04T13:08:48.521-0600	ERROR	PoolService-9	FLAKY_FTE_254e92ae	Trino: information_schema for prefix tmp_trino_254e92ae_:
...
2023-03-04T13:09:26.991-0600	ERROR	PoolService-0	FLAKY_FTE_ac10452c	Trino: leftover tables: 1; prefix: tmp_trino_ac10452c_
2023-03-04T13:09:26.991-0600	ERROR	PoolService-0	FLAKY_FTE_ac10452c	MySQL: information_schema for prefix tmp_trino_ac10452c_: tmp_trino_ac10452c_972a56ba
2023-03-04T13:09:27.101-0600	ERROR	PoolService-0	FLAKY_FTE_ac10452c	MySQL: tmp_trino_ac10452c_972a56ba does not exist; error: java.sql.SQLSyntaxErrorException: Table 'tpch.tmp_trino_ac10452c_972a56ba' doesn't exist [statement:"SELECT count(1) FROM tmp_trino_ac10452c_972a56ba", arguments:{positional:{}, named:{}, finder:[]}]
2023-03-04T13:09:27.289-0600	ERROR	PoolService-0	FLAKY_FTE_ac10452c	Trino: information_schema for prefix tmp_trino_ac10452c_:

Possible solutions

assertEventually doesn't help always, test is still flaky though lesser chance of occurrence.

Reducing concurrency also doesn't seem to help with the failure.

Alternative tests

Instead of observing the database state we can try to instead check whether we issued DROP TABLE statements for all temporary tables. This needs us to implement a generic mechanism to record queries regardless of connector (BaseJdbc vs others).

@mwd410 @losipiuk Do you have any ideas how to address this? This will be a problem with any system which doesn't have a strongly consistent information_schema like MySQL, SingleStore and maybe others.

@hashhar
Copy link
Member

hashhar commented Mar 4, 2023

Also, maybe we can disable the validation by default but keep the code in there so we can test the cleanup manually when adding FTE to some new connector only.

@kokosing
Copy link
Member

kokosing commented Mar 6, 2023

This will be a problem with any system which doesn't have a strongly consistent information_schema like MySQL, SingleStore and maybe others.

I think that since it is remote database specific, we need to handle this by some abstraction that will be implemented per each remote database (connector). And in some cases where information_schema is very unreliable we need to accept that we could consider skipping that assertions. Hopefully assertEventually in few of them would be good enough solution.

@losipiuk
Copy link
Member

losipiuk commented Mar 8, 2023

This will be a problem with any system which doesn't have a strongly consistent information_schema like MySQL, SingleStore and maybe others.

I think that since it is remote database specific, we need to handle this by some abstraction that will be implemented per each remote database (connector). And in some cases where information_schema is very unreliable we need to accept that we could consider skipping that assertions. Hopefully assertEventually in few of them would be good enough solution.

Yeah - I agree. Maybe we can just expose protected assertNoTmpTablesLeftBehind() with default implementation and override where neeeded.

@losipiuk
Copy link
Member

@mwd410 can you look into this one?

@mwd410
Copy link
Contributor

mwd410 commented Mar 10, 2023

We could, instead of simply relying on show tables like to return no entries, we could do

assertThatThrownBy(() -> getQueryRunner().execute("SELECT 1 FROM %s WHERE 1 = 0".formatted(tempTableName)))
                        .hasMessageContaining("%s does not exist", tempTableName);

Thoughts?

@mwd410
Copy link
Contributor

mwd410 commented Mar 10, 2023

I have opened #16499 to address this.

@mwd410
Copy link
Contributor

mwd410 commented Mar 16, 2023

@ebyhr - i merged my pull request the other day - have se seen any of these errors since then? If not can we resolve this issue?

@ebyhr
Copy link
Member Author

ebyhr commented Mar 16, 2023

@mwd410 Thanks! Let's close and reopen if needed.

@ebyhr ebyhr closed this as completed Mar 16, 2023
@anusudarsan
Copy link
Member

@anusudarsan anusudarsan reopened this Mar 23, 2023
@findepi
Copy link
Member

findepi commented Mar 24, 2023

https://github.com/trinodb/trino/actions/runs/4509499706/jobs/7939611131?pr=16707

Error:  io.trino.plugin.postgresql.TestPostgresQueryFailureRecoveryTest.testParallel[testCreateTable](8)  Time elapsed: 57.04 s  <<< FAILURE!
java.lang.AssertionError: expected [1] but found [2]
	at org.testng.Assert.fail(Assert.java:94)
	at org.testng.Assert.failNotEquals(Assert.java:513)
	at org.testng.Assert.assertEqualsImpl(Assert.java:135)
	at org.testng.Assert.assertEquals(Assert.java:116)
	at org.testng.Assert.assertEquals(Assert.java:389)
	at org.testng.Assert.assertEquals(Assert.java:399)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.finishesSuccessfully(BaseFailureRecoveryTest.java:651)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.finishesSuccessfully(BaseFailureRecoveryTest.java:641)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.finishesSuccessfully(BaseFailureRecoveryTest.java:631)
	at io.trino.testing.BaseFailureRecoveryTest.testNonSelect(BaseFailureRecoveryTest.java:389)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:348)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:343)
	at io.trino.testing.BaseFailureRecoveryTest.testCreateTable(BaseFailureRecoveryTest.java:237)
	at io.trino.testing.BaseFailureRecoveryTest$ParallelTestRunnable.run(BaseFailureRecoveryTest.java:905)
	at io.trino.testing.BaseFailureRecoveryTest.testParallel(BaseFailureRecoveryTest.java:228)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
	at org.testng.internal.TestMethodWithDataProviderMethodWorker.call(TestMethodWithDataProviderMethodWorker.java:75)
	at org.testng.internal.TestMethodWithDataProviderMethodWorker.call(TestMethodWithDataProviderMethodWorker.java:14)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

cc @mwd410 @arhimondr @losipiuk @linzebing

@arhimondr
Copy link
Contributor

@mwd410 Could you please take a look?

@mwd410
Copy link
Contributor

mwd410 commented May 15, 2023

From a cursory glance, this appears to be due to Oracle's "Recycling Bin" mechanism. I have opened a pr to address this issue, where I also give further explanation.

@findepi @ebyhr

@raunaqmorarka
Copy link
Member

Still encountering this after latest fix
https://github.com/trinodb/trino/actions/runs/5004142646/jobs/8966655659?pr=17212

io.trino.plugin.oracle.TestOracleQueryFailureRecoveryTest.testParallel[testCreateTable](9)  Time elapsed: 54.428 s  <<< FAILURE!
org.junit.ComparisonFailure: 
[There should be no remaining tmp_trino tables that are queryable. They are:
	For queryId [20230517_144752_00198_biqit] (prefix [tmp_trino_9d29747f_]) remaining tables: [tmp_trino_9d29747f_06f5762c]
		With errors: [
			Expecting code to raise a throwable.]] 
Expecting value to be true but was false expected:<[tru]e> but was:<[fals]e>
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
	at io.trino.testing.BaseFailureRecoveryTest.checkTemporaryTables(BaseFailureRecoveryTest.java:463)
	at io.trino.testing.BaseFailureRecoveryTest$FailureRecoveryAssert.cleansUpTemporaryTables(BaseFailureRecoveryTest.java:624)
	at io.trino.testing.BaseFailureRecoveryTest.testNonSelect(BaseFailureRecoveryTest.java:409)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:347)
	at io.trino.testing.BaseFailureRecoveryTest.testTableModification(BaseFailureRecoveryTest.java:342)
	at io.trino.testing.BaseFailureRecoveryTest.testCreateTable(BaseFailureRecoveryTest.java:236)
	at io.trino.testing.BaseFailureRecoveryTest$ParallelTestRunnable.run(BaseFailureRecoveryTest.java:910)
	at io.trino.testing.BaseFailureRecoveryTest.testParallel(BaseFailureRecoveryTest.java:227)

@mwd410
Copy link
Contributor

mwd410 commented May 19, 2023

@raunaqmorarka - OK, I spent all day trying to reproduce this locally, and then looked closer at the output and found this. Documentation suggests it means the table's lock is already held by a different query, but I have no idea what that might be - any ideas? This will require more investigation next week, but at least we have a thread to pull on.

Note: I also see the same error in the previous run that re-opened this issue for the same reason, so I think my previously added PURGE clause in the drop table was unnecessary (though probably wise to keep nonetheless)

2023-05-17T08:47:52.722-0600	ERROR	transaction-finishing-0	io.trino.metadata.CatalogMetadata	Connector threw exception on abort
io.trino.spi.TrinoException: ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1201)
	at io.trino.plugin.oracle.OracleClient.dropTable(OracleClient.java:310)
	at io.trino.plugin.jdbc.BaseJdbcClient.rollbackCreateTable(BaseJdbcClient.java:1050)
	at io.trino.plugin.jdbc.ForwardingJdbcClient.rollbackCreateTable(ForwardingJdbcClient.java:249)
	at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.lambda$rollbackCreateTable$36(StatisticsAwareJdbcClient.java:330)
	at io.trino.plugin.jdbc.jmx.JdbcApiStats.wrap(JdbcApiStats.java:47)
	at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.rollbackCreateTable(StatisticsAwareJdbcClient.java:330)
	at io.trino.plugin.jdbc.CachingJdbcClient.rollbackCreateTable(CachingJdbcClient.java:392)
	at io.trino.plugin.jdbc.CachingJdbcClient.rollbackCreateTable(CachingJdbcClient.java:392)
	at io.trino.plugin.jdbc.DefaultJdbcMetadata.lambda$beginCreateTable$16(DefaultJdbcMetadata.java:778)
	at java.base/java.util.Optional.ifPresent(Optional.java:178)
	at io.trino.plugin.jdbc.DefaultJdbcMetadata.rollback(DefaultJdbcMetadata.java:841)
	at io.trino.plugin.jdbc.JdbcTransactionManager.rollback(JdbcTransactionManager.java:64)
	at io.trino.plugin.jdbc.JdbcConnector.rollback(JdbcConnector.java:109)
	at io.trino.metadata.CatalogTransaction.abort(CatalogTransaction.java:93)
	at io.trino.metadata.CatalogMetadata.safeAbort(CatalogMetadata.java:167)
	at io.trino.metadata.CatalogMetadata.abort(CatalogMetadata.java:161)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
	at io.trino.$gen.Trino_testversion____20230517_144649_596.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.sql.SQLException: ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:629)
	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:563)
	at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1230)
	at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:771)
	at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:298)
	at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:511)
	at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:122)
	at oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1199)
	at oracle.jdbc.driver.OracleStatement.executeSQLStatement(OracleStatement.java:1819)
	at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1471)
	at oracle.jdbc.driver.OracleStatement.executeInternal(OracleStatement.java:2504)
	at oracle.jdbc.driver.OracleStatement.execute(OracleStatement.java:2459)
	at oracle.jdbc.driver.OracleStatementWrapper.execute(OracleStatementWrapper.java:327)
	at oracle.ucp.jdbc.proxy.oracle$1ucp$1jdbc$1proxy$1oracle$1StatementProxy$2oracle$1jdbc$1internal$1OracleStatement$$$Proxy.execute(Unknown Source)
	at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1211)
	at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1198)
	... 25 more
	Suppressed: java.lang.RuntimeException: Query: DROP TABLE "TRINO_TEST"."TMP_TRINO_9D29747F_06F5762C" PURGE
		at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1214)
		... 26 more
Caused by: Error : 54, Position : 24, Sql = DROP TABLE "TRINO_TEST"."TMP_TRINO_9D29747F_06F5762C" PURGE, OriginalSql = DROP TABLE "TRINO_TEST"."TMP_TRINO_9D29747F_06F5762C" PURGE, Error Msg = ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:636)
	... 40 more

@mwd410
Copy link
Contributor

mwd410 commented May 19, 2023

Ah, according to this, I believe the solution is to lock the table ahead of dropping it. Apparently, by default, oracle does not wait for other transactions on a table to complete before acquiring its lock to drop it.

I suspect this is caused by a race condition between injected errors while some other thread is inserting into the table (e.g. a JdbcPageSink), then we try to rollbackCreateTable while that other transaction is still running. Note: the above trace does include rollbackCreateTable.

@findepi
Copy link
Member

findepi commented May 26, 2023

@mwd410
Copy link
Contributor

mwd410 commented May 26, 2023

sorry for the delay on this - there was some review on my PR which I just got in moments ago.

@findepi
Copy link
Member

findepi commented Jun 2, 2023

https://github.com/trinodb/trino/actions/runs/5141998837/jobs/9255271329?pr=17722

Error:  Failures: 
Error:    TestOracleQueryFailureRecoveryTest>BaseFailureRecoveryTest.testParallel:227->BaseFailureRecoveryTest.testInsert:244->BaseFailureRecoveryTest.testTableModification:342->BaseFailureRecoveryTest.testTableModification:347->BaseFailureRecoveryTest.testNonSelect:409->BaseFailureRecoveryTest.checkTemporaryTables:463 [There should be no remaining tmp_trino tables that are queryable. They are:
	For queryId [20230601_084133_00223_u7vng] (prefix [tmp_trino_ec95e300_]) remaining tables: [tmp_trino_ec95e300_4b157d1c]
		With errors: [
			Expecting code to raise a throwable.]] 
Expecting value to be true but was false expected:<[tru]e> but was:<[fals]e>

@findepi
Copy link
Member

findepi commented Jun 2, 2023

@mwd410 what's the status of this?

@losipiuk
Copy link
Member

losipiuk commented Jun 5, 2023

There was hope it would get fixed with: #17575. cc: @mwd410

Apparently did not help.
We are still getting

io.trino.spi.TrinoException: ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at io.trino.plugin.oracle.OracleClient.dropTable(OracleClient.java:321)
	at io.trino.plugin.jdbc.BaseJdbcClient.rollbackCreateTable(BaseJdbcClient.java:1050)
	at io.trino.plugin.jdbc.ForwardingJdbcClient.rollbackCreateTable(ForwardingJdbcClient.java:249)
	at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.lambda$rollbackCreateTable$36(StatisticsAwareJdbcClient.java:330)
	at io.trino.plugin.jdbc.jmx.JdbcApiStats.wrap(JdbcApiStats.java:47)
	at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.rollbackCreateTable(StatisticsAwareJdbcClient.java:330)
	at io.trino.plugin.jdbc.CachingJdbcClient.rollbackCreateTable(CachingJdbcClient.java:392)
	at io.trino.plugin.jdbc.CachingJdbcClient.rollbackCreateTable(CachingJdbcClient.java:392)
	at io.trino.plugin.jdbc.DefaultJdbcMetadata.lambda$beginInsert$20(DefaultJdbcMetadata.java:853)
	at java.base/java.util.Optional.ifPresent(Optional.java:178)
	at io.trino.plugin.jdbc.DefaultJdbcMetadata.rollback(DefaultJdbcMetadata.java:841)
	at io.trino.plugin.jdbc.JdbcTransactionManager.rollback(JdbcTransactionManager.java:64)
	at io.trino.plugin.jdbc.JdbcConnector.rollback(JdbcConnector.java:109)
	at io.trino.metadata.CatalogTransaction.abort(CatalogTransaction.java:93)
	at io.trino.metadata.CatalogMetadata.safeAbort(CatalogMetadata.java:167)
	at io.trino.metadata.CatalogMetadata.abort(CatalogMetadata.java:161)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:74)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
	at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
	at io.trino.$gen.Trino_testversion____20230601_084025_3091.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.sql.SQLException: ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:629)
	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:563)
	at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1230)
	at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:771)
	at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:298)
	at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:511)
	at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:122)
	at oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1199)
	at oracle.jdbc.driver.OracleStatement.executeSQLStatement(OracleStatement.java:1819)
	at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1471)
	at oracle.jdbc.driver.OracleStatement.executeInternal(OracleStatement.java:2504)
	at oracle.jdbc.driver.OracleStatement.execute(OracleStatement.java:2459)
	at oracle.jdbc.driver.OracleStatementWrapper.execute(OracleStatementWrapper.java:327)
	at oracle.ucp.jdbc.proxy.oracle$1ucp$1jdbc$1proxy$1oracle$1StatementProxy$2oracle$1jdbc$1internal$1OracleStatement$$$Proxy.execute(Unknown Source)
	at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1211)
	at io.trino.plugin.oracle.OracleClient.dropTable(OracleClient.java:318)
	... 24 more
	Suppressed: java.lang.RuntimeException: Query: DROP TABLE "TRINO_TEST"."TMP_TRINO_EC95E300_4B157D1C" PURGE
		at io.trino.plugin.jdbc.BaseJdbcClient.execute(BaseJdbcClient.java:1214)
		... 25 more
Caused by: Error : 54, Position : 24, Sql = DROP TABLE "TRINO_TEST"."TMP_TRINO_EC95E300_4B157D1C" PURGE, OriginalSql = DROP TABLE "TRINO_TEST"."TMP_TRINO_EC95E300_4B157D1C" PURGE, Error Msg = ORA-00054: resource busy and acquire with NOWAIT specified or timeout expired

	at oracle.jdbc.driver.T4CTTIoer11.processError(T4CTTIoer11.java:636)
	... 39 more

@mwd410
Copy link
Contributor

mwd410 commented Jun 5, 2023

hmm. i have one thread to pull on. looking

@mwd410
Copy link
Contributor

mwd410 commented Jun 5, 2023

Oh the perils of forgetting about auto commit. 🤦

@losipiuk
Copy link
Member

losipiuk commented Jun 6, 2023

Merged #17756 (thanks @mwd410 ). Hopefully it will solve the issue.

@findepi
Copy link
Member

findepi commented Jun 7, 2023

🤞

@findepi findepi closed this as completed Jun 7, 2023
@ebyhr
Copy link
Member Author

ebyhr commented Jun 8, 2023

Let me reopen because the test is still failing on master.

Error:  Failures: 
Error:    TestOracleQueryFailureRecoveryTest>BaseFailureRecoveryTest.testParallel:227->BaseFailureRecoveryTest.testExplainAnalyze:320->BaseFailureRecoveryTest.testTableModification:342->BaseFailureRecoveryTest.testTableModification:347->BaseFailureRecoveryTest.testNonSelect:409->BaseFailureRecoveryTest.checkTemporaryTables:463 [There should be no remaining tmp_trino tables that are queryable. They are:
	For queryId [20230607_221201_00339_7dv8e] (prefix [tmp_trino_68459c86_]) remaining tables: [tmp_trino_68459c86_f31eaf34]
		With errors: [
			Expecting code to raise a throwable.]] 
Expecting value to be true but was false expected:<[tru]e> but was:<[fals]e>

https://github.com/trinodb/trino/actions/runs/5203676867/jobs/9388677568

@mwd410
Copy link
Contributor

mwd410 commented Jun 9, 2023

i'm kind of at a loss here. is it possibly that we are

- opening a connection
  - creating first statement, acquiring lock, closing first statement
  - creating second statement, dropping table, closing second statement
- closing connection

whereas perhaps we should be

- opening a connection
  - creating first statement
  - creating second statement
  - acquiring lock
  - dropping table
  - closing first and second statement
- closing connection

I don't have any other ideas. This doesn't seem to be behaving as documented or how i understand it should behave. Am i missing something?

@losipiuk
Copy link
Member

IDK what is happening here but maybe we can try to add some extra loggin (e.g. dump locks information before executing DROP STATEMENT). Maybe it will shed some light.
It is happening fairly often so you should be able to easily reporduce that on CI with modified ci.yml so you run just impacted job but many times @mwd410

@findepi
Copy link
Member

findepi commented Jun 12, 2023

in the meantime, would it make sense to mark TestOracleQueryFailureRecoveryTest.testParallel as @Flaky @losipiuk @mwd410 ?

@losipiuk
Copy link
Member

in the meantime, would it make sense to mark TestOracleQueryFailureRecoveryTest.testParallel as @flaky @losipiuk @mwd410 ?

Good idea. @mwd410 please send the PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery BigQuery connector bug Something isn't working test
Development

No branches or pull requests

9 participants