-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test BaseFailureRecoveryTest.testParallel
: "There should be no remaining tmp_trino tables that are queryable"
#16277
Comments
TestBigQueryQueryFailureRecoveryTest.testParallel
BaseFailureRecoveryTest.testParallel
|
I suspect a query id hash collision, let me see if changing the temp table name generation helps. |
At the moment since I can't reproduce this at will I had to look at the code and make guesses. The one thing which caught my attention is that we use a cc: @mwd410 for any ideas |
Found the problem, MySQL doesn't have a strongly consistent information schema so sometimes tables appear to exist even after they've been dropped. I verified this by checking if the table that the test complains was not cleaned up actually exists or not. In none of the cases I was able to find the table which the test claims was not cleaned up. I'll check if an |
Well, MySQL's information_schema seems to act horribly under contention. Here's some logs. 1st CaseTable appears to exist as seen from Trino. Note that it can take multiple retries from Trino for table to be seen as deleted.
Case 2Table seen as exists on Trino.
Possible solutions
Reducing concurrency also doesn't seem to help with the failure. Alternative testsInstead of observing the database state we can try to instead check whether we issued DROP TABLE statements for all temporary tables. This needs us to implement a generic mechanism to record queries regardless of connector (BaseJdbc vs others). @mwd410 @losipiuk Do you have any ideas how to address this? This will be a problem with any system which doesn't have a strongly consistent information_schema like MySQL, SingleStore and maybe others. |
Also, maybe we can disable the validation by default but keep the code in there so we can test the cleanup manually when adding FTE to some new connector only. |
I think that since it is remote database specific, we need to handle this by some abstraction that will be implemented per each remote database (connector). And in some cases where information_schema is very unreliable we need to accept that we could consider skipping that assertions. Hopefully |
Yeah - I agree. Maybe we can just expose |
@mwd410 can you look into this one? |
We could, instead of simply relying on assertThatThrownBy(() -> getQueryRunner().execute("SELECT 1 FROM %s WHERE 1 = 0".formatted(tempTableName)))
.hasMessageContaining("%s does not exist", tempTableName); Thoughts? |
I have opened #16499 to address this. |
@ebyhr - i merged my pull request the other day - have se seen any of these errors since then? If not can we resolve this issue? |
@mwd410 Thanks! Let's close and reopen if needed. |
@mwd410 saw this today here https://github.com/starburstdata/starburst-trino-plugins/pull/156/files#diff-b1ef0b3848be78431e2c376255c12d7b98cde03f985e75f00b0a331b5bd836af. That version should have your fix right? |
https://github.com/trinodb/trino/actions/runs/4509499706/jobs/7939611131?pr=16707
|
@mwd410 Could you please take a look? |
Still encountering this after latest fix
|
@raunaqmorarka - OK, I spent all day trying to reproduce this locally, and then looked closer at the output and found this. Documentation suggests it means the table's lock is already held by a different query, but I have no idea what that might be - any ideas? This will require more investigation next week, but at least we have a thread to pull on. Note: I also see the same error in the previous run that re-opened this issue for the same reason, so I think my previously added
|
Ah, according to this, I believe the solution is to lock the table ahead of dropping it. Apparently, by default, oracle does not wait for other transactions on a table to complete before acquiring its lock to drop it. I suspect this is caused by a race condition between injected errors while some other thread is inserting into the table (e.g. a JdbcPageSink), then we try to rollbackCreateTable while that other transaction is still running. Note: the above trace does include rollbackCreateTable. |
sorry for the delay on this - there was some review on my PR which I just got in moments ago. |
https://github.com/trinodb/trino/actions/runs/5141998837/jobs/9255271329?pr=17722
|
@mwd410 what's the status of this? |
There was hope it would get fixed with: #17575. cc: @mwd410 Apparently did not help.
|
hmm. i have one thread to pull on. looking |
Oh the perils of forgetting about auto commit. 🤦 |
🤞 |
Let me reopen because the test is still failing on master.
https://github.com/trinodb/trino/actions/runs/5203676867/jobs/9388677568 |
i'm kind of at a loss here. is it possibly that we are
whereas perhaps we should be
I don't have any other ideas. This doesn't seem to be behaving as documented or how i understand it should behave. Am i missing something? |
IDK what is happening here but maybe we can try to add some extra loggin (e.g. dump locks information before executing |
https://github.com/trinodb/trino/actions/runs/4279764588/jobs/7451040205
The text was updated successfully, but these errors were encountered: