Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement predicate pushdown for table functions #17928

Conversation

homar
Copy link
Member

@homar homar commented Jun 16, 2023

Description

Implement predicate pushdown for table functions - it is necessary to improve performance of queries using table_changes on delta lake

Additional context and related issues

Release notes

(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jun 16, 2023
@github-actions github-actions bot added delta-lake Delta Lake connector hudi Hudi connector labels Jun 16, 2023
@homar homar force-pushed the homar/add_possiblity_to_implement_predicate_pushdown_in_ptf branch 5 times, most recently from b08f7f2 to dc0bbf0 Compare June 19, 2023 08:25
@homar homar changed the title initial Implement predicate pushdown for table functions Jun 19, 2023
@homar homar marked this pull request as ready for review June 26, 2023 09:56
@homar homar requested a review from alexjo2144 June 26, 2023 12:35
@homar homar force-pushed the homar/add_possiblity_to_implement_predicate_pushdown_in_ptf branch from e76dc84 to 50f910d Compare July 3, 2023 10:48
@homar
Copy link
Member Author

homar commented Jul 13, 2023

Failure is not related: #17436

@findepi
Copy link
Member

findepi commented Jul 18, 2023

Failure is not related: #17436

thanks, restarted

@ebyhr
Copy link
Member

ebyhr commented Jul 31, 2023

Could you rebase on master to resolve conflicts? CountingAccessMetadata was removed in the recent commit.

@homar homar force-pushed the homar/add_possiblity_to_implement_predicate_pushdown_in_ptf branch 2 times, most recently from ccdf68c to 287854c Compare August 1, 2023 09:40
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some high level design comments

@@ -454,6 +455,7 @@ public PlanOptimizers(
new PruneOrderByInAggregation(metadata),
new RewriteSpatialPartitioningAggregation(plannerContext),
new SimplifyCountOverConstant(plannerContext),
new PushFilterIntoTableFunction(plannerContext, typeAnalyzer),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add it here? and not somewhere else?

for example new PushPredicateIntoTableScan occurs 7 times in this file and intuitiviely it's not clear why table functions should be treated differently from table scans. In fact, one could model table scans as table functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it is the only set that contains ImplementTableFunctionSource so I thought this is the only place when this is needed. Is this incorrect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we follow the analogy between table functions and table scans, in how many places are table scans added into the plan?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how to check this :|


/**
* An area to store all information necessary to execute the table function, gathered at analysis time
*/
@Experimental(eta = "2022-10-31")
public interface ConnectorTableFunctionHandle
{
default Map<String, ColumnHandle> getColumnHandles()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it's not without a design thought that Connector*Handle classes are generally marker interfaces (have no methods)

also, the relation described by the function is known to the engine (TableFunctionAnalysis.returnedType), to the engine doesn't need to ask for it again

please remove this method

return Map.of();
}

default boolean supportsPredicatePushdown()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would this be needed?

ConnectorMetadata.applyFilter(ConnectorSession, ConnectorTableHandle, Constraint) is invoked without asking first whether "pushdown is supported" and this lets the implementation consider whether this particular pushdown is supported. Of course this results in a bit of redundant computation. Redundant computation can be addressed, but for regular table scans is definitely much more important than for table functions (based solely on usage frequency). I don't see a reason to do differently (presumably smarter) for table functions.

please remove this method

this.enforcedConstraint = null;
}

public TableFunctionProcessorNode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this project we have a rule that classes have a single constructor responsible for initialization.
I.e. all other constructors should delegate to one chosen constructor.

(i see that TableScanNode deviates from the rule and should be fixed)

@@ -211,6 +268,14 @@ public List<Symbol> getOutputSymbols()
return symbols.build();
}

@Nullable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see checkState(enforcedConstraint != null, inside the method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there were 2 constructors, one assumed this will never be null and second just it to null, i will change it so there is one constructor and the field can be null

Comment on lines 157 to 154
Set<Symbol> partitionBy = specification
.map(DataOrganizationSpecification::getPartitionBy)
.map(ImmutableSet::copyOf)
.orElse(ImmutableSet.of());
checkArgument(partitionBy.containsAll(prePartitioned), "all pre-partitioned symbols must be contained in the partitioning list");
this.preSorted = preSorted;
checkArgument(
specification
.flatMap(DataOrganizationSpecification::getOrderingScheme)
.map(OrderingScheme::getOrderBy)
.map(List::size)
.orElse(0) >= preSorted,
"the number of pre-sorted symbols cannot be greater than the number of all ordering symbols");
checkArgument(preSorted == 0 || partitionBy.equals(prePartitioned), "to specify pre-sorted symbols, it is required that all partitioning symbols are pre-partitioned");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think i would want to maintain two separate copies of this logic in two independent constructors.

@@ -70,6 +76,9 @@ public class TableFunctionProcessorNode

private final TableFunctionHandle handle;

@Nullable // null on workers
private final TupleDomain<ColumnHandle> enforcedConstraint;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not convinced ColumnHandle should be the TupleDomain key here.

  • TableFunctionAnalysis.returnedType describes the relation in terms of Fields with optional names so they would be positionally discernable (TupleDomain<Integer> ...)
  • properOutputs field captures the symbols produced by the function, and we could use that ( TupleDomain<Symbol>)

i think this should be TupleDomain<Symbol>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be TupleDomain<Integer>

@homar homar force-pushed the homar/add_possiblity_to_implement_predicate_pushdown_in_ptf branch from 287854c to 3dd2d90 Compare August 5, 2023 09:13
@github-actions github-actions bot added tests:hive iceberg Iceberg connector hive Hive connector labels Aug 5, 2023
@github-actions github-actions bot added bigquery BigQuery connector mongodb MongoDB connector labels Aug 5, 2023
@homar homar force-pushed the homar/add_possiblity_to_implement_predicate_pushdown_in_ptf branch from 3dd2d90 to aeb2a98 Compare August 5, 2023 19:29
@homar homar requested a review from findepi August 5, 2023 21:00
@homar homar force-pushed the homar/add_possiblity_to_implement_predicate_pushdown_in_ptf branch from aeb2a98 to 8af4978 Compare August 6, 2023 21:32
Copy link
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to introduce infrastructure to support pushing down arbitrary operations into table functions. Such an approach will force us into a matrix of pushdown capabilities and require implementations of all the applyXXX for TableHandle and for TableFunctionHandle, and corresponding optimizers.

We already have a mechanism for this. A table function that's at the leaf of the plan (i.e., takes no inputs) and wants to behave as a table and participate in pushdown can implement the applyTableFunction call and then rely on applyXXX for TableHandle.

@findepi
Copy link
Member

findepi commented Aug 10, 2023

A table function that's at the leaf of the plan (i.e., takes no inputs) and wants to behave as a table and participate in pushdown can implement the applyTableFunction call and then rely on applyXXX for TableHandle.

We definitely didn't want to squeeze everything into TableHandle abstraction (and that have a matrix of pushdown-related ifs and in page sources), that's why initial CDF PTF implementation waited for the PTF execution support. But we definitely can go that route.
That makes leaf TF pretty limited though. Even as simple table functions as sequence() could eventually want to leverage predicate pushdown (eg if join sequence in my query with something and engine can derive some predicates from that). Would that mean practically no leaf TF should be implemented using leaf TF execution support?

@homar
Copy link
Member Author

homar commented Aug 10, 2023

We don't want to introduce infrastructure to support pushing down arbitrary operations into table functions. Such an approach will force us into a matrix of pushdown capabilities and require implementations of all the applyXXX for TableHandle and for TableFunctionHandle, and corresponding optimizers.

We already have a mechanism for this. A table function that's at the leaf of the plan (i.e., takes no inputs) and wants to behave as a table and participate in pushdown can implement the applyTableFunction call and then rely on applyXXX for TableHandle.

@martint my approach was suggested by @kasiafi who implemented most of the table functions infrastructure - also it is described in a Polymorphic table functions's follow up section here and no concerns were raised there.

I understand your concern, it is not great to have all those applyXXX functions but I am afraid table functions won't be useful if they aren't performant.
If you have another idea how to improve performance of table functions please share it.

Copy link

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jan 15, 2024
@mosabua
Copy link
Member

mosabua commented Jan 15, 2024

@homar @findepi @martint is this still in progress and discussion?

@homar
Copy link
Member Author

homar commented Jan 16, 2024

@homar @findepi @martint is this still in progress and discussion?

I am afraid that @martint decided that this is not a way to go so there is nothing I can do here. I am closing it

@homar homar closed this Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery BigQuery connector cla-signed delta-lake Delta Lake connector hive Hive connector hudi Hudi connector iceberg Iceberg connector mongodb MongoDB connector stale
Development

Successfully merging this pull request may close these issues.

5 participants