Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding column in SQL Server connector sometimes fails even though column has been added successfully #12535

Closed
hashhar opened this issue May 25, 2022 · 30 comments · Fixed by #15984
Assignees
Labels
bug Something isn't working correctness

Comments

@hashhar
Copy link
Member

hashhar commented May 25, 2022

On SQLServer:

Error:  io.trino.plugin.sqlserver.TestSqlServerConnectorTest.testAddColumnConcurrently  Time elapsed: 1.245 s  <<< FAILURE!
java.lang.AssertionError: Unexpected concurrent add column failure
	at io.trino.testing.BaseConnectorTest.verifyConcurrentAddColumnFailurePermissible(BaseConnectorTest.java:2912)
	at io.trino.testing.BaseConnectorTest.lambda$testAddColumnConcurrently$25(BaseConnectorTest.java:2877)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
	Suppressed: java.lang.RuntimeException: Transaction (Process ID 53) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
		at io.trino.testing.AbstractTestingTrinoClient.execute(AbstractTestingTrinoClient.java:122)
		at io.trino.testing.AbstractTestingTrinoClient.execute(AbstractTestingTrinoClient.java:86)
		at io.trino.testing.DistributedQueryRunner.execute(DistributedQueryRunner.java:514)
		at io.trino.testing.BaseConnectorTest.lambda$testAddColumnConcurrently$25(BaseConnectorTest.java:2871)
		... 4 more
	Caused by: io.trino.spi.TrinoException: Transaction (Process ID 53) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
		at io.trino.plugin.jdbc.BaseJdbcClient.getColumns(BaseJdbcClient.java:283)
		at io.trino.plugin.jdbc.ForwardingJdbcClient.getColumns(ForwardingJdbcClient.java:90)
		at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.lambda$getColumns$4(StatisticsAwareJdbcClient.java:110)
		at io.trino.plugin.jdbc.jmx.JdbcApiStats.wrap(JdbcApiStats.java:35)
		at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.getColumns(StatisticsAwareJdbcClient.java:110)
		at io.trino.plugin.jdbc.CachingJdbcClient.lambda$getColumns$3(CachingJdbcClient.java:161)
		at io.trino.collect.cache.EmptyCache.get(EmptyCache.java:66)
		at io.trino.plugin.jdbc.CachingJdbcClient.get(CachingJdbcClient.java:700)
		at io.trino.plugin.jdbc.CachingJdbcClient.getColumns(CachingJdbcClient.java:161)
		at io.trino.plugin.jdbc.CachingJdbcClient.lambda$getColumns$3(CachingJdbcClient.java:161)
		at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4868)
		at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3533)
		at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2282)
		at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2159)
		at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2049)
		at com.google.common.cache.LocalCache.get(LocalCache.java:3966)
		at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4863)
		at io.trino.collect.cache.EvictableCache.get(EvictableCache.java:108)
		at io.trino.plugin.jdbc.CachingJdbcClient.get(CachingJdbcClient.java:700)
		at io.trino.plugin.jdbc.CachingJdbcClient.getColumns(CachingJdbcClient.java:161)
		at io.trino.plugin.jdbc.DefaultJdbcMetadata.getColumnHandles(DefaultJdbcMetadata.java:625)
		at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.getColumnHandles(ClassLoaderSafeConnectorMetadata.java:288)
		at io.trino.metadata.MetadataManager.getColumnHandles(MetadataManager.java:485)
		at io.trino.execution.AddColumnTask.execute(AddColumnTask.java:97)
		at io.trino.execution.AddColumnTask.execute(AddColumnTask.java:55)
		at io.trino.execution.DataDefinitionExecution.start(DataDefinitionExecution.java:145)
		at io.trino.execution.SqlQueryManager.createQuery(SqlQueryManager.java:243)
		at io.trino.dispatcher.LocalDispatchQuery.lambda$startExecution$7(LocalDispatchQuery.java:143)
		at io.trino.$gen.Trino_testversion____20220524_132750_5.run(Unknown Source)
		at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
		at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
		at java.base/java.lang.Thread.run(Thread.java:829)
	Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Transaction (Process ID 53) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
		at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:265)
		at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1676)
		at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:615)
		at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:537)
		at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7785)
		at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3793)
		at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:268)
		at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:242)
		at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:456)
		at com.microsoft.sqlserver.jdbc.SQLServerDatabaseMetaData.getColumns(SQLServerDatabaseMetaData.java:665)
		at io.trino.plugin.jdbc.BaseJdbcClient.getColumns(BaseJdbcClient.java:301)
		at io.trino.plugin.jdbc.BaseJdbcClient.getColumns(BaseJdbcClient.java:235)
		... 31 more
Caused by: [CIRCULAR REFERENCE: io.trino.spi.TrinoException: Transaction (Process ID 53) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.]

https://github.com/trinodb/trino/runs/6572998375?check_suite_focus=true

@hashhar hashhar added the test label May 25, 2022
@ebyhr ebyhr self-assigned this May 25, 2022
@findepi findepi added the bug Something isn't working label May 25, 2022
@findepi findepi changed the title Flaky BaseConnectorTest testAddColumnConcurrently Flaky test TestSqlServerConnectorTest testAddColumnConcurrently Aug 11, 2022
@ebyhr
Copy link
Member

ebyhr commented Jan 25, 2023

@arhimondr
Copy link
Contributor

@ebyhr
Copy link
Member

ebyhr commented Jan 26, 2023

@ebyhr
Copy link
Member

ebyhr commented Jan 26, 2023

The test looks very flaky recently. Also, it's incorrect results now, not deadlock issue as issue description.

Error:  io.trino.plugin.sqlserver.TestSqlServerConnectorTest.testAddColumnConcurrently
[INFO]   Run 1: PASS
[INFO]   Run 2: PASS
[INFO]   Run 3: PASS
Error:    Run 4: TestSqlServerConnectorTest>BaseConnectorTest.testAddColumnConcurrently:3988 [Rows for query [DESCRIBE test_add_column4pv7pdzyfm projected with [0]]] 
Expecting:
  <(col), (col3), (col2), (col0)>
to contain exactly in any order:
  <[(col), (col0), (col2)]>
but the following elements were unexpected:
  <(col3)>

@findepi
Copy link
Member

findepi commented Jan 26, 2023

https://github.com/trinodb/trino/actions/runs/4015532208/jobs/6900588157

Error:  io.trino.plugin.sqlserver.TestSqlServerConnectorTest.testAddColumnConcurrently
Error:    Run 1: TestSqlServerConnectorTest>BaseConnectorTest.testAddColumnConcurrently:3988 [Rows for query [DESCRIBE test_add_column0pae5lc2yl projected with [0]]] 
Expecting:
  <(col), (col0), (col3), (col2)>
to contain exactly in any order:
  <[(col), (col2), (col3)]>
but the following elements were unexpected:
  <(col0)>

@findepi
Copy link
Member

findepi commented Jan 27, 2023

https://github.com/trinodb/trino/actions/runs/4017027158/jobs/6903973714

Error:  io.trino.plugin.sqlserver.TestSqlServerConnectorTest.testAddColumnConcurrently  Time elapsed: 0.104 s  <<< FAILURE!
java.lang.AssertionError: 
[Rows for query [DESCRIBE test_add_columnqjp1ugtnfw projected with [0]]] 
Expecting:
  <(col), (col3), (col2), (col0), (col1)>
to contain exactly in any order:
  <[(col), (col0), (col1), (col3)]>
but the following elements were unexpected:
  <(col2)>

@findepi
Copy link
Member

findepi commented Jan 27, 2023

https://github.com/trinodb/trino/actions/runs/4022968719/jobs/6915702323

Error:    Run 3: TestSqlServerConnectorTest>BaseConnectorTest.testAddColumnConcurrently:3988 [Rows for query [DESCRIBE test_add_column18aad6bdm7 projected with [0]]] 
Expecting:
  <(col), (col0), (col1), (col3), (col2)>
to contain exactly in any order:
  <[(col), (col0), (col2), (col3)]>
but the following elements were unexpected:
  <(col1)>

@findepi
Copy link
Member

findepi commented Jan 30, 2023

https://github.com/trinodb/trino/actions/runs/4025145616/jobs/6920371453

Error:  io.trino.plugin.sqlserver.TestSqlServerConnectorTest.testAddColumnConcurrently
Error:    Run 1: TestSqlServerConnectorTest>BaseConnectorTest.testAddColumnConcurrently:3988 [Rows for query [DESCRIBE test_add_column12o4lm5g7t projected with [0]]] 
Expecting:
  <(col), (col1), (col0), (col2), (col3)>
to contain exactly in any order:
  <[(col), (col1), (col2), (col3)]>
but the following elements were unexpected:
  <(col0)>

@findepi
Copy link
Member

findepi commented Feb 1, 2023

https://github.com/trinodb/trino/actions/runs/4055689915/jobs/6985043630

[INFO]   Run 2: PASS
Error:    Run 3: TestSqlServerConnectorTest>BaseConnectorTest.testAddColumnConcurrently:3988 [Rows for query [DESCRIBE test_add_columntpnsozqctv projected with [0]]] 
Expecting:
  <(col), (col3), (col1), (col0), (col2)>
to contain exactly in any order:
  <[(col), (col1), (col2), (col3)]>
but the following elements were unexpected:
  <(col0)>

@findepi
Copy link
Member

findepi commented Feb 1, 2023

@findepi
Copy link
Member

findepi commented Feb 1, 2023

@vlad-lyutenko is working on this. Thank you Vlad!

@findepi
Copy link
Member

findepi commented Feb 1, 2023

@findinpath
Copy link
Contributor

@vlad-lyutenko
Copy link
Contributor

For me this failures looks pretty strange.
As short description - during execution we are failing to add column, but after this, during DESCRIBE, column is present.
I am not able to reproduce this locally, so I created PR https://github.com/trinodb/trino/pull/15970/files, where I added more logging and increase invocation count, so will try to catch it on CI.

@findepi
Copy link
Member

findepi commented Feb 3, 2023

Can we mark the test @Flaky or disable it while the investigation is ongoing? cc @kokosing

@ebyhr
Copy link
Member

ebyhr commented Feb 3, 2023

Disabling the test looks good to me. I guess @Flaky doesn't work because the test fails almost every time on CI.

@vlad-lyutenko
Copy link
Contributor

I modified a little bit test - in the way, that I try to insert data into columns, which were failed to add with exception:
Transaction (Process ID 52) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

And I see, that sometimes I can successfully add data to this columns, so they actually were added.
Thats why our test is failing sometime.

So the question is, do we have some retry mechanism in engine ?

@vlad-lyutenko
Copy link
Contributor

Some info:
Transaction (Process ID 52) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

This exception could occur not only during actual 'ALTER table ' connector execution, but also during getColumns() call of BaseJdbcClient. And in this second case, sometimes column is added despite an exception.

@vlad-lyutenko
Copy link
Contributor

vlad-lyutenko commented Feb 4, 2023

Some flow execution from logs https://github.com/trinodb/trino/actions/runs/4091562653/jobs/7055747852:

  1. Create table and submit tasks:
2023-02-04T06:44:57.610-0600	INFO	TestNGInvoker-testAddColumnConcurrently()	io.trino.testing.BaseConnectorTest	Starting add columns for table test_add_columnks9wqhln9v
2023-02-04T06:44:57.617-0600	INFO	TestNGInvoker-testAddColumnConcurrently()	io.trino.testing.BaseConnectorTest	All alter tasks submitted, futures size: 5 for table test_add_columnks9wqhln9v


2023-02-04T06:44:57.618-0600	INFO	pool-266-thread-1	io.trino.testing.BaseConnectorTest	About to add column: col0 to table test_add_columnks9wqhln9v
2023-02-04T06:44:57.623-0600	INFO	pool-266-thread-2	io.trino.testing.BaseConnectorTest	About to add column: col1 to table test_add_columnks9wqhln9v
2023-02-04T06:44:57.627-0600	INFO	pool-266-thread-3	io.trino.testing.BaseConnectorTest	About to add column: col2 to table test_add_columnks9wqhln9v
2023-02-04T06:44:57.633-0600	INFO	pool-266-thread-5	io.trino.testing.BaseConnectorTest	About to add column: col4 to table test_add_columnks9wqhln9v
2023-02-04T06:44:57.633-0600	INFO	pool-266-thread-4	io.trino.testing.BaseConnectorTest	About to add column: col3 to table test_add_columnks9wqhln9v
  1. Execution for col0 - successfull
2023-02-04T06:44:57.883-0600	INFO	dispatcher-query-10	io.trino.event.QueryMonitor	TIMELINE: Query 20230204_124457_00426_279z4 :: FINISHED :: elapsed 261ms :: planning 0ms :: waiting 0ms :: scheduling 261ms :: running 0ms :: finishing 261ms :: begin 2023-02-04T06:44:57.621-06:00 :: end 2023-02-04T06:44:57.882-06:00
2023-02-04T06:44:57.883-0600	INFO	pool-266-thread-1	io.trino.testing.BaseConnectorTest	Column col0 added for table test_add_columnks9wqhln9v
  1. Execution for col2 and col4 fails inside getColumns:
2023-02-04T06:44:59.239-0600	INFO	dispatcher-query-0	io.trino.event.QueryMonitor	TIMELINE: Query 20230204_124457_00429_279z4 :: FAILED (JDBC_ERROR) :: elapsed 1597ms :: planning 1597ms :: waiting 0ms :: scheduling 0ms :: running 0ms :: finishing 0ms :: begin 2023-02-04T06:44:57.641-06:00 :: end 2023-02-04T06:44:59.238-06:00
2023-02-04T06:44:59.240-0600	ERROR	pool-266-thread-3	io.trino.testing.BaseConnectorTest	Original exception during submitting alter column col2 for table test_add_columnks9wqhln9v:
Transaction (Process ID 52) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
	at io.trino.plugin.jdbc.BaseJdbcClient.getColumns(BaseJdbcClient.java:347)
	... 32 more
2023-02-04T06:44:59.240-0600	WARN	pool-266-thread-3	io.trino.testing.BaseConnectorTest	Return fail value for column :col2 table test_add_columnks9wqhln9v


2023-02-04T06:44:59.241-0600	ERROR	pool-266-thread-5	io.trino.testing.BaseConnectorTest	Original exception during submitting alter column col4 for table test_add_columnks9wqhln9v:
Transaction (Process ID 52) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.
	at io.trino.plugin.jdbc.BaseJdbcClient.getColumns(BaseJdbcClient.java:347)
	... 32 more
2023-02-04T06:44:59.241-0600	WARN	pool-266-thread-5	io.trino.testing.BaseConnectorTest	Return fail value for column :col4 table test_add_columnks9wqhln9v
  1. Now attempt to insert data in col2, which failed before to be added:
2023-02-04T06:44:59.261-0600	WARN	TestNGInvoker-testAddColumnConcurrently()	io.trino.testing.BaseConnectorTest	About to insert data for failed column :col2 table test_add_columnks9wqhln9v
2023-02-04T06:44:59.261-0600	INFO	dispatcher-query-0	io.trino.event.QueryMonitor	TIMELINE: Query 20230204_124457_00430_279z4 :: FINISHED :: elapsed 1618ms :: planning 0ms :: waiting 0ms :: scheduling 1618ms :: running 0ms :: finishing 1618ms :: begin 2023-02-04T06:44:57.642-06:00 :: end 2023-02-04T06:44:59.260-06:00
2023-02-04T06:44:59.339-0600	INFO	dispatcher-query-12	io.trino.event.QueryMonitor	TIMELINE: Query 20230204_124459_00431_279z4 :: FAILED (COLUMN_NOT_FOUND) :: elapsed 73ms :: planning 73ms :: waiting 0ms :: scheduling 0ms :: running 0ms :: finishing 0ms :: begin 2023-02-04T06:44:59.265-06:00 :: end 2023-02-04T06:44:59.338-06:00
2023-02-04T06:44:59.340-0600	ERROR	TestNGInvoker-testAddColumnConcurrently()	io.trino.testing.BaseConnectorTest	Exception during insert into column col2 for table test_add_columnks9wqhln9v:
io.trino.testing.QueryFailedException: Insert column name does not exist in target table: col2
  1. Now attempt to insert data in col4, which failed before to be added, but successfully added data:
2023-02-04T06:44:59.340-0600	WARN	TestNGInvoker-testAddColumnConcurrently()	io.trino.testing.BaseConnectorTest	About to insert data for failed column :col4 table test_add_columnks9wqhln9v
2023-02-04T06:44:59.846-0600	INFO	TestNGInvoker-testAddColumnConcurrently()	io.trino.testing.BaseConnectorTest	!!!!!!!!!!!! Data inserted for failed column col4 for table test_add_columnks9wqhln9v

@guiyanakuang
Copy link
Member

The default isolation level for sqlserver is READ COMMITTED. So concurrent alter and read metadata may result in deadlocks.

public void addColumn(Session session, TableHandle tableHandle, ColumnMetadata column)
{
CatalogHandle catalogHandle = tableHandle.getCatalogHandle();
ConnectorMetadata metadata = getMetadataForWrite(session, catalogHandle);
metadata.addColumn(session.toConnectorSession(catalogHandle), tableHandle.getConnectorHandle(), column);
CatalogMetadata catalogMetadata = getCatalogMetadataForWrite(session, catalogHandle);
if (catalogMetadata.getSecurityManagement() != CONNECTOR) {
TableMetadata tableMetadata = getTableMetadata(session, tableHandle);
CatalogSchemaTableName sourceTableName = new CatalogSchemaTableName(catalogHandle.getCatalogName(), tableMetadata.getTable());
systemSecurityMetadata.columnCreated(session, sourceTableName, column.getName());
}
}

From the source code, both lines 744 and 748 may throw deadlock exceptions, but line 748 throws an exception and alter actually succeeds.

Maybe add a more detailed judgment to BaseSqlServerConnectorTest#verifyConcurrentAddColumnFailurePermissible to fix it.

@guiyanakuang
Copy link
Member

guiyanakuang commented Feb 5, 2023

On further thought, the flaky failure may be directly related to #15225. The actual operation and notification to SystemSecurityMetadata is not atomic. Should all exceptions for notifications to SystemSecurityMetadata be caught?

@vlad-lyutenko
Copy link
Contributor

From the source code, both lines 744 and 748 may throw deadlock exceptions, but line 748 throws an exception and alter actually succeeds.

Maybe add a more detailed judgment to BaseSqlServerConnectorTest#verifyConcurrentAddColumnFailurePermissible to fix it.

@guiyanakuang Thanks for your comment.

Yep, it was my initial thought, but then I noticed, that we actually fails not during metadata.addColumn() call, but one step before in AddColumnTask during getting column handle

Map<String, ColumnHandle> columnHandles = plannerContext.getMetadata().getColumnHandles(session, tableHandle);

And this column is still added, so cannot understand how it could be. Maybe some retries on engine level.

Please take a look on stack trace:
Caused by: io.trino.spi.TrinoException: Transaction (Process ID 55) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. at io.trino.plugin.jdbc.BaseJdbcClient.getColumns(BaseJdbcClient.java:347) at io.trino.plugin.jdbc.ForwardingJdbcClient.getColumns(ForwardingJdbcClient.java:96) at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.lambda$getColumns$5(StatisticsAwareJdbcClient.java:116) at io.trino.plugin.jdbc.jmx.JdbcApiStats.wrap(JdbcApiStats.java:35) at io.trino.plugin.jdbc.jmx.StatisticsAwareJdbcClient.getColumns(StatisticsAwareJdbcClient.java:116) at io.trino.plugin.jdbc.CachingJdbcClient.lambda$getColumns$3(CachingJdbcClient.java:163) at io.trino.collect.cache.EmptyCache.get(EmptyCache.java:97) at io.trino.plugin.jdbc.CachingJdbcClient.get(CachingJdbcClient.java:666) at io.trino.plugin.jdbc.CachingJdbcClient.getColumns(CachingJdbcClient.java:163) at io.trino.plugin.jdbc.CachingJdbcClient.lambda$getColumns$3(CachingJdbcClient.java:163) at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4868) at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3533) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2282) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2159) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2049) at com.google.common.cache.LocalCache.get(LocalCache.java:3966) at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4863) at io.trino.collect.cache.EvictableCache.get(EvictableCache.java:114) at io.trino.plugin.jdbc.CachingJdbcClient.get(CachingJdbcClient.java:666) at io.trino.plugin.jdbc.CachingJdbcClient.getColumns(CachingJdbcClient.java:163) at io.trino.plugin.jdbc.DefaultJdbcMetadata.getColumnHandles(DefaultJdbcMetadata.java:676) at io.trino.plugin.base.classloader.ClassLoaderSafeConnectorMetadata.getColumnHandles(ClassLoaderSafeConnectorMetadata.java:287) at io.trino.metadata.MetadataManager.getColumnHandles(MetadataManager.java:470) at io.trino.execution.AddColumnTask.execute(AddColumnTask.java:96) at io.trino.execution.AddColumnTask.execute(AddColumnTask.java:54) at io.trino.execution.DataDefinitionExecution.start(DataDefinitionExecution.java:145) at io.trino.execution.SqlQueryManager.createQuery(SqlQueryManager.java:249) at io.trino.dispatcher.LocalDispatchQuery.lambda$startExecution$7(LocalDispatchQuery.java:143) at io.trino.$gen.Trino_testversion____20230204_124241_141.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Transaction (Process ID 55) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction. at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDatabaseError(SQLServerException.java:265) at com.microsoft.sqlserver.jdbc.SQLServerStatement.getNextResult(SQLServerStatement.java:1673) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.doExecutePreparedStatement(SQLServerPreparedStatement.java:620) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement$PrepStmtExecCmd.doExecute(SQLServerPreparedStatement.java:540) at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7627) at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:3912) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeCommand(SQLServerStatement.java:268) at com.microsoft.sqlserver.jdbc.SQLServerStatement.executeStatement(SQLServerStatement.java:242) at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeQuery(SQLServerPreparedStatement.java:459) at com.microsoft.sqlserver.jdbc.SQLServerDatabaseMetaData.getColumns(SQLServerDatabaseMetaData.java:665) at io.trino.plugin.jdbc.BaseJdbcClient.getColumns(BaseJdbcClient.java:365) at io.trino.plugin.jdbc.BaseJdbcClient.getColumns(BaseJdbcClient.java:299)

@guiyanakuang
Copy link
Member

guiyanakuang commented Feb 5, 2023

Yes, there are three places where a deadlock exception may be thrown.

  1. Map<String, ColumnHandle> columnHandles = plannerContext.getMetadata().getColumnHandles(session, tableHandle);
  2. metadata.addColumn(session.toConnectorSession(catalogHandle), tableHandle.getConnectorHandle(), column);
  3. TableMetadata tableMetadata = getTableMetadata(session, tableHandle);
No. Order of call Jdbc api Does the test fail when a deadlock exception is thrown (Y/N)
1 1 read No. No alter has been executed. This throws a deadlock, which is caught and returned Option.empty(). As expected.
2 2 alter No. Alter execution throws an exception and the test case returns Option.empty(). No fields have been added. As expected.
3 3 read Yes. Alter has been executed,. After this an exception is thrown and the test case returns Option.empty(). The field has actually been added. Not as expected.

And this column is still added, so cannot understand how it could be. Maybe some retries on engine level.

@vlad-lyutenko. So the test will actually only fail if the code deadlocks in the third place. Because in the second place we have successfully executed alter. Sqlserver does not retry.

@vlad-lyutenko
Copy link
Contributor

@guiyanakuang
Big thanks, now it's clear, I missed this 3rd call (read after alter), I thought that it's failing on first read and was wonder how column still could be added.

@vlad-lyutenko
Copy link
Contributor

vlad-lyutenko commented Feb 5, 2023

So what you think could be good fix for, this test.
Ideally would be catch and pass this exception in verifyConcurrentAddColumnFailurePermissible.
But not sure we are able to understand that exception occurs on 3rd read not on first.

Or check by performing force insert into column to understand that column actually added?

@guiyanakuang
Copy link
Member

On further thought, the flaky failure may be directly related to #15225. The actual operation and notification to SystemSecurityMetadata is not atomic. Should all exceptions for notifications to SystemSecurityMetadata be caught?

I now prefer to add exception catching to anything like the third place (#15225 added event notification code)

@findepi findepi changed the title Flaky test TestSqlServerConnectorTest testAddColumnConcurrently Adding column in SQL Server connector sometimes fails even though column has been added successfully Feb 6, 2023
@findepi
Copy link
Member

findepi commented Feb 6, 2023

I now prefer to add exception catching to anything like the third place (#15225 added event notification code)

Just note that this may or may not fix the issue.
I just realized that this issue was first reported long before #15225 was merged. Seems I hijacked an old thread. I hope the original problem is long gone.

@guiyanakuang
Copy link
Member

Just note that this may or may not fix the issue.

This issue was created on May 25, 2022.
It appears to have been fixed with #12695 (Jun 6, 2022), but the issue is not closed.

#15225 was merged into master a two weeks ago.
The bug reports for this issue have also started appearing frequently in the last two weeks. This also confirms it.

findepi added a commit that referenced this issue Feb 6, 2023
This reverts commit 33fdaa0.

The current implementation has the following problems:

- does not work for Hive connector with `hive.security=system`
  (#15225 (comment))
- notifications are propagated even if transaction doesn't end up committing
  the DDL changes (#15225 (comment))
- notification propagation may disrupt transaction finish even if changes already
  applied in the external system (#12535 (comment)),
  for any connector that's using SYSTEM security (Iceberg, Delta with system security, or any JDBC connector)
@kokosing
Copy link
Member

kokosing commented Feb 6, 2023

findepi added a commit that referenced this issue Feb 8, 2023
This reverts commit 33fdaa0.

The current implementation has the following problems:

- does not work for Hive connector with `hive.security=system`
  (#15225 (comment))
- notifications are propagated even if transaction doesn't end up committing
  the DDL changes (#15225 (comment))
- notification propagation may disrupt transaction finish even if changes already
  applied in the external system (#12535 (comment)),
  for any connector that's using SYSTEM security (Iceberg, Delta with system security, or any JDBC connector)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctness
Development

Successfully merging a pull request may close this issue.

8 participants