Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestHive3OnDataLake, TestHiveRoles flaky (SchemaAlreadyExists during test initialization) #10772

Closed
losipiuk opened this issue Jan 24, 2022 · 9 comments
Labels
bug Something isn't working test

Comments

@losipiuk
Copy link
Member

Example: https://github.com/trinodb/trino/runs/4904524264?check_suite_focus=true

2022-01-22T03:09:36.5153964Z [ERROR] io.trino.plugin.hive.TestHive3OnDataLake.init  Time elapsed: 37.987 s  <<< FAILURE!
2022-01-22T03:09:36.5154716Z io.trino.plugin.hive.SchemaAlreadyExistsException: Schema already exists: 'tpch'
2022-01-22T03:09:36.5155468Z 	at io.trino.plugin.hive.metastore.thrift.ThriftHiveMetastore.createDatabase(ThriftHiveMetastore.java:1003)
2022-01-22T03:09:36.5156337Z 	at io.trino.plugin.hive.metastore.thrift.BridgingHiveMetastore.createDatabase(BridgingHiveMetastore.java:164)
2022-01-22T03:09:36.5157069Z 	at io.trino.plugin.hive.HiveQueryRunner$Builder.populateData(HiveQueryRunner.java:211)
2022-01-22T03:09:36.5157651Z 	at io.trino.plugin.hive.HiveQueryRunner$Builder.build(HiveQueryRunner.java:197)
2022-01-22T03:09:36.5158225Z 	at io.trino.plugin.hive.s3.S3HiveQueryRunner$Builder.build(S3HiveQueryRunner.java:160)
2022-01-22T03:09:36.5158800Z 	at io.trino.plugin.hive.s3.S3HiveQueryRunner.create(S3HiveQueryRunner.java:51)
2022-01-22T03:09:36.5159472Z 	at io.trino.plugin.hive.BaseTestHiveOnDataLake.createQueryRunner(BaseTestHiveOnDataLake.java:89)
2022-01-22T03:09:36.5160182Z 	at io.trino.testing.AbstractTestQueryFramework.init(AbstractTestQueryFramework.java:83)
2022-01-22T03:09:36.5160830Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2022-01-22T03:09:36.5161505Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2022-01-22T03:09:36.5162269Z 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2022-01-22T03:09:36.5162888Z 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
2022-01-22T03:09:36.5163476Z 	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
2022-01-22T03:09:36.5164122Z 	at org.testng.internal.Invoker.invokeConfigurationMethod(Invoker.java:515)
2022-01-22T03:09:36.5164708Z 	at org.testng.internal.Invoker.invokeConfigurations(Invoker.java:217)
2022-01-22T03:09:36.5165265Z 	at org.testng.internal.Invoker.invokeConfigurations(Invoker.java:144)
2022-01-22T03:09:36.5165895Z 	at org.testng.internal.TestMethodWorker.invokeBeforeClassMethods(TestMethodWorker.java:169)
2022-01-22T03:09:36.5166529Z 	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:108)
2022-01-22T03:09:36.5167533Z 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2022-01-22T03:09:36.5168181Z 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2022-01-22T03:09:36.5168665Z 	at java.base/java.lang.Thread.run(Thread.java:829)
2022-01-22T03:09:36.5169158Z 
2022-01-22T03:09:36.5172073Z [ERROR] io.trino.plugin.hive.TestHiveRoles.afterMethod  Time elapsed: 0.001 s  <<< FAILURE!
2022-01-22T03:09:36.5172622Z java.lang.IllegalStateException: queryRunner not set
2022-01-22T03:09:36.5173272Z 	at com.google.common.base.Preconditions.checkState(Preconditions.java:502)
2022-01-22T03:09:36.5217912Z 	at io.trino.testing.AbstractTestQueryFramework.getQueryRunner(AbstractTestQueryFramework.java:450)
2022-01-22T03:09:36.5220909Z 	at io.trino.plugin.hive.AbstractTestHiveRoles.executeFromAdmin(AbstractTestHiveRoles.java:503)
2022-01-22T03:09:36.5223796Z 	at io.trino.plugin.hive.AbstractTestHiveRoles.listRoles(AbstractTestHiveRoles.java:466)
2022-01-22T03:09:36.5226953Z 	at io.trino.plugin.hive.AbstractTestHiveRoles.afterMethod(AbstractTestHiveRoles.java:64)
2022-01-22T03:09:36.5229795Z 	at io.trino.plugin.hive.TestHiveRoles.afterMethod(TestHiveRoles.java:18)
2022-01-22T03:09:36.5231799Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2022-01-22T03:09:36.5233475Z 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2022-01-22T03:09:36.5235256Z 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2022-01-22T03:09:36.5236974Z 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
2022-01-22T03:09:36.5238347Z 	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
2022-01-22T03:09:36.5239659Z 	at org.testng.internal.Invoker.invokeConfigurationMethod(Invoker.java:515)
2022-01-22T03:09:36.5240450Z 	at org.testng.internal.Invoker.invokeConfigurations(Invoker.java:217)
2022-01-22T03:09:36.5241445Z 	at org.testng.internal.Invoker.invokeMethod(Invoker.java:712)
2022-01-22T03:09:36.5241998Z 	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
2022-01-22T03:09:36.5242539Z 	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
2022-01-22T03:09:36.5243131Z 	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
2022-01-22T03:09:36.5243850Z 	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
2022-01-22T03:09:36.5244466Z 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
2022-01-22T03:09:36.5245120Z 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
2022-01-22T03:09:36.5245603Z 	at java.base/java.lang.Thread.run(Thread.java:829)
@losipiuk losipiuk added bug Something isn't working test labels Jan 24, 2022
@losipiuk
Copy link
Member Author

cc: @aczajkowski (not for a very specific reason - just you happen to touch stuff around somewhat recently)

@aczajkowski
Copy link
Member

aczajkowski commented Jan 25, 2022

As for the first error:

            if (metastore.getDatabase(TPCH_SCHEMA).isEmpty()) {
                metastore.createDatabase(identity, createDatabaseMetastoreObject(TPCH_SCHEMA, initialSchemasLocationBase));
                Session session = initialTablesSessionMutator.apply(createSession(Optional.empty()));
                copyTpchTables(queryRunner, "tpch", TINY_SCHEMA_NAME, session, initialTables);
            }

It looks that for some reason we manage to get some trash cross test in MinIO.
As metastore.getDatabase(TPCH_SCHEMA).isEmpty() would filter out cases when HMS already has metadata.
So this look that there is no metadata. So we call HMS to create one.
But when we try to create resources in MinIO (effectively HMS tries to) we get error.
Looks similar to and old one some time ago which we never managed to narrow down. #9415

@findepi findepi changed the title TestHiveRoles flaky (SchemaAlreadyExists during test initialization) TestHive3OnDataLake, TestHiveRoles flaky (SchemaAlreadyExists during test initialization) Mar 3, 2022
@findepi
Copy link
Member

findepi commented Mar 3, 2022

@Praveen2112
Copy link
Member

@aczajkowski
Copy link
Member

I believe root cause of issue is here:

.run("createDatabase", stats.getCreateDatabase().wrap(() -> {

It's also similar problem in createTable seen here and identified by @findinpath #10894 (comment)

We retry on all TTransportException's and one of them is timeout.
org.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out
In such case seems Hive does what it was asked for, but we just don't know about it. We retry and boom schema or table is already there.

Similar issue was reported by @losipiuk in cleaning. But in this case we call delete, get timeout, retry and again no resource available as Hive made it's job in mean time.

On production systems timeout may be configured:


But in test it's hardcoded and set to 10s.
this.factory = new DefaultThriftMetastoreClientFactory(Optional.empty(), socksProxy, TIMEOUT, AUTHENTICATION, "localhost");

We could either:

  • handle read timeout on create, delete
  • allow configuring timeout in tests
  • or extend timeout for all tests.

I like last one as from error nature it seems we are on the edge of this timeout. As retry is pretty fast and we get already exists which suggest that Hive already did it's job.

CC: @losipiuk @findepi @findinpath

@losipiuk
Copy link
Member Author

Thanks @aczajkowski for looking into that!
Let's bump the timeout in tests 2x for now as you suggested.
Other than that we can also make a change to RetryDriver so we can configure that for some exception types logging is done on warn (by default retries are logged on debug), but I do not feel strongly.

@aczajkowski
Copy link
Member

@losipiuk PR: #11592

@aczajkowski
Copy link
Member

Haven see this error for a while. Any one hit that since changing HMS read timeout ?
CC: @losipiuk @findepi @Praveen2112

@losipiuk
Copy link
Member Author

I did not see it neither. Let's close. thx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working test
Development

No branches or pull requests

4 participants