Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential regression in spilled ORDER BY #16406

Closed
findepi opened this issue Mar 7, 2023 · 10 comments · Fixed by #16431
Closed

Potential regression in spilled ORDER BY #16406

findepi opened this issue Mar 7, 2023 · 10 comments · Fixed by #16431
Labels
bug Something isn't working RELEASE-BLOCKER

Comments

@findepi
Copy link
Member

findepi commented Mar 7, 2023

example from #16400 build https://github.com/trinodb/trino/actions/runs/4351418003/jobs/7603093117

Error:  Tests run: 3957, Failures: 2, Errors: 0, Skipped: 318, Time elapsed: 1,918.51 s <<< FAILURE! - in TestSuite
Error:  io.trino.tests.TestSpilledOrderByQueries.testOrderBy  Time elapsed: 0.165 s  <<< FAILURE!
java.lang.AssertionError: Execution of 'actual' query failed: SELECT orderstatus FROM orders ORDER BY orderstatus
	at org.testng.Assert.fail(Assert.java:83)
	at io.trino.testing.QueryAssertions.assertQuery(QueryAssertions.java:150)
	at io.trino.testing.QueryAssertions.assertQuery(QueryAssertions.java:106)
	at io.trino.testing.AbstractTestQueryFramework.assertQueryOrdered(AbstractTestQueryFramework.java:381)
	at io.trino.testing.AbstractTestQueryFramework.assertQueryOrdered(AbstractTestQueryFramework.java:371)
	at io.trino.testing.AbstractTestQueryFramework.assertQueryOrdered(AbstractTestQueryFramework.java:366)
	at io.trino.testing.AbstractTestOrderByQueries.testOrderBy(AbstractTestOrderByQueries.java:30)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:104)
	at org.testng.internal.Invoker.invokeMethod(Invoker.java:645)
	at org.testng.internal.Invoker.invokeTestMethod(Invoker.java:851)
	at org.testng.internal.Invoker.invokeTestMethods(Invoker.java:1177)
	at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:129)
	at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:112)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: io.trino.testing.QueryFailedException: Index (0) is greater than or equal to list size (0)
	at io.trino.testing.AbstractTestingTrinoClient.execute(AbstractTestingTrinoClient.java:122)
	at io.trino.testing.DistributedQueryRunner.execute(DistributedQueryRunner.java:489)
	at io.trino.testing.QueryAssertions.assertQuery(QueryAssertions.java:147)
	... 18 more
	Suppressed: java.lang.Exception: SQL: SELECT orderstatus FROM orders ORDER BY orderstatus
		at io.trino.testing.DistributedQueryRunner.execute(DistributedQueryRunner.java:492)
		... 19 more
Caused by: java.lang.IndexOutOfBoundsException: Index (0) is greater than or equal to list size (0)
	at it.unimi.dsi.fastutil.longs.LongArrayList.getLong(LongArrayList.java:267)
	at io.trino.operator.PagesIndex.buildPage(PagesIndex.java:306)
	at io.trino.operator.PagesIndex$2.computeNext(PagesIndex.java:631)
	at io.trino.operator.PagesIndex$2.computeNext(PagesIndex.java:624)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:141)
	at io.trino.operator.WorkProcessorUtils.lambda$fromIterator$0(WorkProcessorUtils.java:103)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.testing.QueryAssertions.assertQuery(QueryAssertions.java:147)
	... 18 more
	Suppressed: java.lang.Exception: SQL: SELECT orderkey, custkey, orderstatus FROM orders ORDER BY nullif(orderkey, 3) ASC NULLS LAST, custkey ASC
		at io.trino.testing.DistributedQueryRunner.execute(DistributedQueryRunner.java:492)
		... 19 more
Caused by: java.lang.IndexOutOfBoundsException: Index (0) is greater than or equal to list size (0)
	at it.unimi.dsi.fastutil.longs.LongArrayList.getLong(LongArrayList.java:267)
	at io.trino.operator.PagesIndex.buildPage(PagesIndex.java:306)
	at io.trino.operator.PagesIndex$2.computeNext(PagesIndex.java:631)
	at io.trino.operator.PagesIndex$2.computeNext(PagesIndex.java:624)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:141)
	at io.trino.operator.WorkProcessorUtils.lambda$fromIterator$0(WorkProcessorUtils.java:103)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$2.process(WorkProcessorUtils.java:126)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:262)
	at io.trino.operator.WorkProcessorUtils$YieldingProcess.process(WorkProcessorUtils.java:182)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$YieldingIterator.computeNext(WorkProcessorUtils.java:81)
	at io.trino.operator.WorkProcessorUtils$YieldingIterator.computeNext(WorkProcessorUtils.java:67)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:141)
	at io.trino.operator.OrderByOperator.getOutput(OrderByOperator.java:275)
	at io.trino.operator.Driver.processInternal(Driver.java:394)
	at io.trino.operator.Driver.lambda$process$8(Driver.java:297)
	at io.trino.operator.Driver.tryWithLock(Driver.java:689)
	at io.trino.operator.Driver.process(Driver.java:289)
	at io.trino.operator.Driver.processForDuration(Driver.java:260)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:773)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:523)
	at io.trino.$gen.Trino_testversion____20230307_070445_4328.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
Error:  Failures: 
Error:    TestSpilledOrderByQueries>AbstractTestOrderByQueries.testOrderBy:30->AbstractTestQueryFramework.assertQueryOrdered:366->AbstractTestQueryFramework.assertQueryOrdered:371->AbstractTestQueryFramework.assertQueryOrdered:381 Execution of 'actual' query failed: SELECT orderstatus FROM orders ORDER BY orderstatus
Error:    TestSpilledOrderByQueries>AbstractTestOrderByQueries.testOrderByWithNulls:149->AbstractTestQueryFramework.assertQueryOrdered:366->AbstractTestQueryFramework.assertQueryOrdered:371->AbstractTestQueryFramework.assertQueryOrdered:381 Execution of 'actual' query failed: SELECT orderkey, custkey, orderstatus FROM orders ORDER BY nullif(orderkey, 3) ASC NULLS LAST, custkey ASC
[INFO] 
@findepi findepi added the bug Something isn't working label Mar 7, 2023
@findepi
Copy link
Member Author

findepi commented Mar 7, 2023

Marked release-blocker, because it looks like a regression after 409.

@findepi
Copy link
Member Author

findepi commented Mar 7, 2023

https://nineinchnick.github.io/trino-cicd/reports/flaky/ currently doesn't include "spilled" (like in TestSpilledOrderByQueries)

cc @nineinchnick

@nineinchnick
Copy link
Member

The annotation processing job has been failing recently, and we're relying on these annotations to create that report. We do have all the test reports downloaded from GHA artifacts, but there are no XML functions in Trino to process them. We'd have to create another job to convert them to JSON.

I just opened ScaCap/action-surefire-report#138 to address failures in annotation processing.

@aczajkowski
Copy link
Member

This error can be seen in man test's and in different combinations:

  • TestDistributedSpilledQueries.testOrderByWithAgregation
  • TestDistributedSpilledQueries.testPredicate
  • TestSpilledJoinQueries.testSemiJoin
  • TestSpilledOrderByQueries.testOrderByWithOutputColumnReference

Root cause stack trace is always the same

Caused by: java.lang.IndexOutOfBoundsException: Index (0) is greater than or equal to list size (0)
	at it.unimi.dsi.fastutil.longs.LongArrayList.getLong(LongArrayList.java:267)
	at io.trino.operator.PagesIndex.buildPage(PagesIndex.java:306)
	at io.trino.operator.PagesIndex$2.computeNext(PagesIndex.java:631)
	at io.trino.operator.PagesIndex$2.computeNext(PagesIndex.java:624)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:141)
	at io.trino.operator.WorkProcessorUtils.lambda$fromIterator$0(WorkProcessorUtils.java:103)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$2.process(WorkProcessorUtils.java:126)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils.getNextState(WorkProcessorUtils.java:262)
	at io.trino.operator.WorkProcessorUtils$YieldingProcess.process(WorkProcessorUtils.java:182)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$3.process(WorkProcessorUtils.java:347)
	at io.trino.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:413)
	at io.trino.operator.WorkProcessorUtils$YieldingIterator.computeNext(WorkProcessorUtils.java:81)
	at io.trino.operator.WorkProcessorUtils$YieldingIterator.computeNext(WorkProcessorUtils.java:67)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:146)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:141)
	at io.trino.operator.OrderByOperator.getOutput(OrderByOperator.java:275)
	at io.trino.operator.Driver.processInternal(Driver.java:394)
	at io.trino.operator.Driver.lambda$process$8(Driver.java:297)
	at io.trino.operator.Driver.tryWithLock(Driver.java:689)
	at io.trino.operator.Driver.process(Driver.java:289)
	at io.trino.operator.Driver.processForDuration(Driver.java:260)
	at io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:773)
	at io.trino.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
	at io.trino.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:523)
	at io.trino.$gen.Trino_testversion____20230305_044615_29138.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

@findepi
Copy link
Member Author

findepi commented Mar 7, 2023

I added some debug prints into OrderByOperator and run TestSpilledOrderByQueries locally (it fails quite consistently locally, failing method varies)

Example failure gave me the log (i've filtered this down to a single operator instance based in identity hash code)

OrderByOperator.OrderByOperator created pagesIndex 1620083669
OrderByOperator.getOutput
OrderByOperator.addInput
OrderByOperator.updateMemoryUsage
OrderByOperator.getOutput
OrderByOperator.addInput
OrderByOperator.updateMemoryUsage
OrderByOperator.getOutput
OrderByOperator.getOutput
OrderByOperator.startMemoryRevoke
OrderByOperator.finishMemoryRevoke
OrderByOperator (finishMemoryRevoke) pagesIndex.clear 1620083669
OrderByOperator.updateMemoryUsage
OrderByOperator.addInput
OrderByOperator.updateMemoryUsage
OrderByOperator.getOutput
OrderByOperator.addInput
OrderByOperator.updateMemoryUsage
OrderByOperator.getOutput
OrderByOperator.startMemoryRevoke
OrderByOperator.finishMemoryRevoke
OrderByOperator (finishMemoryRevoke) pagesIndex.clear 1620083669
OrderByOperator.updateMemoryUsage
OrderByOperator.addInput
OrderByOperator.updateMemoryUsage
OrderByOperator.getOutput
OrderByOperator.startMemoryRevoke
OrderByOperator.finishMemoryRevoke
OrderByOperator (finishMemoryRevoke) pagesIndex.clear 1620083669
OrderByOperator.updateMemoryUsage
OrderByOperator.addInput
OrderByOperator.updateMemoryUsage
OrderByOperator.getOutput
OrderByOperator.addInput
OrderByOperator.updateMemoryUsage
OrderByOperator.getOutput
OrderByOperator.getOutput
OrderByOperator.startMemoryRevoke
OrderByOperator.finish: still spilling
OrderByOperator.finish: state NEEDS_INPUT
OrderByOperator.finish: pageIndex.getSortedPages
OrderByOperator.finishMemoryRevoke
OrderByOperator (finishMemoryRevoke) pagesIndex.clear 1620083669
OrderByOperator.updateMemoryUsage
OrderByOperator.finish: state HAS_OUTPUT
OrderByOperator.getOutput
OrderByOperator.getOutput failed: java.lang.IndexOutOfBoundsException: Index (0) is greater than or equal to list size (0)

It's clear the operator spilled multiple time, the last one between finish and getOutput

OrderByOperator.startMemoryRevoke
...
OrderByOperator.finish: pageIndex.getSortedPages
OrderByOperator.finishMemoryRevoke
OrderByOperator (finishMemoryRevoke) pagesIndex.clear 1620083669
OrderByOperator.updateMemoryUsage
OrderByOperator.finish: state HAS_OUTPUT
OrderByOperator.getOutput

in finish we create iterator over PagesIndex ("pageIndex.getSortedPages"), but later, in finishMemoryRevoke we clear pagesIndex, invalidating the iterator.

As described in #16409, before #15575 such iterator would not fail (would return 0 pages).
After #15575, it fails.

Question: why is finish called before we finish spilling (finishMemoryRevoke)?
I think after startMemoryRevoke the next operation expected by the operator should be finishMemoryRevoke?
did i remember wrong?
cc @sopel39 @losipiuk

@findinpath
Copy link
Contributor

@ebyhr
Copy link
Member

ebyhr commented Mar 8, 2023

@findepi
Copy link
Member Author

findepi commented Mar 8, 2023

This will be fixed by #16411 or #16431

@kasiafi
Copy link
Member

kasiafi commented Mar 8, 2023

@findepi is this a potential correctness issue? could getOutput() potentially be called before the PagesIndex is cleared by finishMemoryRevoke? If so, we would have duplicate page on output, both from the PagesIndex and spilled.

@findepi
Copy link
Member Author

findepi commented Mar 8, 2023

I don't think so. Per

* processing methods on it (isBlocked/needsInput/addInput/getOutput) until
getOutput() cannot be called before finishMemoryRevoke is run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working RELEASE-BLOCKER
6 participants