Fix table function execution without partitioning (v2) #21558

findepi · 2024-04-15T15:47:50Z

Previously, when table function did not declare partitioning it would run single-threaded and first buffer all data in memory, like a one big WINDOW. After the change, the local execution processes input pages in a streaming fashion.

Fixes #20398

Alternative to #21378, with different TableFunctionDataProcessor lifecycle. The implementation creates one TableFunctionDataProcessor per operator for streaming processing.

findepi · 2024-04-15T15:47:54Z

thanks @hovaesco @aalbu @mdesmet for pointing out the lifecycle aspect

findepi · 2024-04-16T19:13:09Z

failure is related.

hovaesco · 2024-04-23T10:37:37Z

I've tested the fix and it helps with #20398 however when writing the data using table function, it still does not distribute the data evenly and there is no way to control the number of files being written.

findepi · 2024-04-23T11:08:24Z

it still does not distribute the data evenly

between the nodes?

there is no "writer scaling" equivalent for table functions, right?

there is no way to control the number of files being written.

do you mean control number of TF data processor instances?

hovaesco · 2024-04-23T11:54:16Z

between the nodes?

between files, when using setSemantics() data is distributed evenly.

there is no "writer scaling" equivalent for table functions, right?

correct

do you mean control number of TF data processor instances?

control the number of files, for setSemantics() it could be done using hive.target-max-file-size property

findepi · 2024-04-23T11:59:23Z

( rebased to resolve conflicts, no other changes )

mdesmet · 2024-04-23T14:36:12Z

do you mean control number of TF data processor instances?

This is indeed where the writers are created. But I guess this depends on the shape of the plan right before going into the table function operator?

findepi · 2024-04-25T12:06:51Z

Planned changes moved out to #21710

github-actions · 2024-05-16T17:04:57Z

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

During migration from TestNG to JUnit `@TestInstance(PER_CLASS)` annotation was added, but it implies single-threaded execution. Restore previous parallelism: either add `@Execution(CONCURRENT)` or inherit it from base class.

Previously, when table function did not declare partitioning, it would be globally distributed, but on a worker node it would run single-threaded and first buffer all data in memory, like a one big WINDOW. After the change, the local execution processes input pages in a streaming fashion. This commit also fixes property derivations for a case where table function is partitioned on empty list of symbols (global grouping).

tbaeg · 2024-05-30T20:51:48Z

core/trino-main/src/main/java/io/trino/operator/function/StreamTableFunctionInput.java

+                    pagesIndex,
+                    0,
+                    pagesIndex.getPositionCount(),
+                    new TableFunctionDataProcessor()


Do we need to wrap in a new anonymous class?

Locally, I directly passed the tableFunction which also resolved the TestJsonTable failures for me.

findepi requested review from martint and kasiafi April 15, 2024 15:47

cla-bot bot added the cla-signed label Apr 15, 2024

findepi force-pushed the findepi/exclude-columns-streaming-with-lifecycle branch from 5b36e66 to a31bbfe Compare April 15, 2024 15:48

This was referenced Apr 15, 2024

Move table function operator to package #21559

Merged

Fix table function execution without partitioning #21378

Draft

findepi force-pushed the findepi/exclude-columns-streaming-with-lifecycle branch from a31bbfe to b358662 Compare April 23, 2024 11:58

findepi mentioned this pull request Apr 25, 2024

Improve planning for table function without partitioning #21710

Draft

findepi force-pushed the findepi/exclude-columns-streaming-with-lifecycle branch 3 times, most recently from e1bf212 to d762983 Compare April 25, 2024 12:14

github-actions bot added the stale label May 16, 2024

findepi added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels May 17, 2024

findepi force-pushed the findepi/exclude-columns-streaming-with-lifecycle branch from d762983 to fa9fdf0 Compare May 23, 2024 14:23

findepi added 3 commits May 23, 2024 16:52

Run tests with more parallelism

d563421

During migration from TestNG to JUnit `@TestInstance(PER_CLASS)` annotation was added, but it implies single-threaded execution. Restore previous parallelism: either add `@Execution(CONCURRENT)` or inherit it from base class.

Rename interface to accomodate non-partitioned data

7b912be

findepi force-pushed the findepi/exclude-columns-streaming-with-lifecycle branch from fa9fdf0 to a892820 Compare May 24, 2024 20:14

github-actions bot added iceberg Iceberg connector delta-lake Delta Lake connector labels May 24, 2024

github-actions bot added the hive Hive connector label May 24, 2024

tbaeg reviewed May 30, 2024

View reviewed changes

tbaeg mentioned this pull request Jul 27, 2024

Allow default parallelism for table functions #22847

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix table function execution without partitioning (v2) #21558

Fix table function execution without partitioning (v2) #21558

findepi commented Apr 15, 2024 •

edited

Loading

findepi commented Apr 15, 2024

findepi commented Apr 16, 2024

hovaesco commented Apr 23, 2024

findepi commented Apr 23, 2024

hovaesco commented Apr 23, 2024

findepi commented Apr 23, 2024

mdesmet commented Apr 23, 2024

findepi commented Apr 25, 2024

github-actions bot commented May 16, 2024

tbaeg May 30, 2024 •

edited

Loading

Fix table function execution without partitioning (v2) #21558

Are you sure you want to change the base?

Fix table function execution without partitioning (v2) #21558

Conversation

findepi commented Apr 15, 2024 • edited Loading

findepi commented Apr 15, 2024

findepi commented Apr 16, 2024

hovaesco commented Apr 23, 2024

findepi commented Apr 23, 2024

hovaesco commented Apr 23, 2024

findepi commented Apr 23, 2024

mdesmet commented Apr 23, 2024

findepi commented Apr 25, 2024

github-actions bot commented May 16, 2024

tbaeg May 30, 2024 • edited Loading

Choose a reason for hiding this comment

findepi commented Apr 15, 2024 •

edited

Loading

tbaeg May 30, 2024 •

edited

Loading