-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement predicate pushdown for table functions #17928
Implement predicate pushdown for table functions #17928
Conversation
b08f7f2
to
dc0bbf0
Compare
e76dc84
to
50f910d
Compare
Failure is not related: #17436 |
thanks, restarted |
Could you rebase on master to resolve conflicts? |
ccdf68c
to
287854c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some high level design comments
@@ -454,6 +455,7 @@ public PlanOptimizers( | |||
new PruneOrderByInAggregation(metadata), | |||
new RewriteSpatialPartitioningAggregation(plannerContext), | |||
new SimplifyCountOverConstant(plannerContext), | |||
new PushFilterIntoTableFunction(plannerContext, typeAnalyzer), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add it here? and not somewhere else?
for example new PushPredicateIntoTableScan
occurs 7 times in this file and intuitiviely it's not clear why table functions should be treated differently from table scans. In fact, one could model table scans as table functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is the only set that contains ImplementTableFunctionSource
so I thought this is the only place when this is needed. Is this incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we follow the analogy between table functions and table scans, in how many places are table scans added into the plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea how to check this :|
|
||
/** | ||
* An area to store all information necessary to execute the table function, gathered at analysis time | ||
*/ | ||
@Experimental(eta = "2022-10-31") | ||
public interface ConnectorTableFunctionHandle | ||
{ | ||
default Map<String, ColumnHandle> getColumnHandles() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess it's not without a design thought that Connector*Handle
classes are generally marker interfaces (have no methods)
also, the relation described by the function is known to the engine (TableFunctionAnalysis.returnedType
), to the engine doesn't need to ask for it again
please remove this method
return Map.of(); | ||
} | ||
|
||
default boolean supportsPredicatePushdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would this be needed?
ConnectorMetadata.applyFilter(ConnectorSession, ConnectorTableHandle, Constraint)
is invoked without asking first whether "pushdown is supported" and this lets the implementation consider whether this particular pushdown is supported. Of course this results in a bit of redundant computation. Redundant computation can be addressed, but for regular table scans is definitely much more important than for table functions (based solely on usage frequency). I don't see a reason to do differently (presumably smarter) for table functions.
please remove this method
this.enforcedConstraint = null; | ||
} | ||
|
||
public TableFunctionProcessorNode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this project we have a rule that classes have a single constructor responsible for initialization.
I.e. all other constructors should delegate to one chosen constructor.
(i see that TableScanNode
deviates from the rule and should be fixed)
@@ -211,6 +268,14 @@ public List<Symbol> getOutputSymbols() | |||
return symbols.build(); | |||
} | |||
|
|||
@Nullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see checkState(enforcedConstraint != null,
inside the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there were 2 constructors, one assumed this will never be null and second just it to null, i will change it so there is one constructor and the field can be null
Set<Symbol> partitionBy = specification | ||
.map(DataOrganizationSpecification::getPartitionBy) | ||
.map(ImmutableSet::copyOf) | ||
.orElse(ImmutableSet.of()); | ||
checkArgument(partitionBy.containsAll(prePartitioned), "all pre-partitioned symbols must be contained in the partitioning list"); | ||
this.preSorted = preSorted; | ||
checkArgument( | ||
specification | ||
.flatMap(DataOrganizationSpecification::getOrderingScheme) | ||
.map(OrderingScheme::getOrderBy) | ||
.map(List::size) | ||
.orElse(0) >= preSorted, | ||
"the number of pre-sorted symbols cannot be greater than the number of all ordering symbols"); | ||
checkArgument(preSorted == 0 || partitionBy.equals(prePartitioned), "to specify pre-sorted symbols, it is required that all partitioning symbols are pre-partitioned"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think i would want to maintain two separate copies of this logic in two independent constructors.
@@ -70,6 +76,9 @@ public class TableFunctionProcessorNode | |||
|
|||
private final TableFunctionHandle handle; | |||
|
|||
@Nullable // null on workers | |||
private final TupleDomain<ColumnHandle> enforcedConstraint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not convinced ColumnHandle
should be the TupleDomain key here.
TableFunctionAnalysis.returnedType
describes the relation in terms of Fields with optional names so they would be positionally discernable (TupleDomain<Integer>
...)properOutputs
field captures the symbols produced by the function, and we could use that (TupleDomain<Symbol>
)
i think this should be TupleDomain<Symbol>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be TupleDomain<Integer>
287854c
to
3dd2d90
Compare
3dd2d90
to
aeb2a98
Compare
aeb2a98
to
8af4978
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to introduce infrastructure to support pushing down arbitrary operations into table functions. Such an approach will force us into a matrix of pushdown capabilities and require implementations of all the applyXXX for TableHandle
and for TableFunctionHandle
, and corresponding optimizers.
We already have a mechanism for this. A table function that's at the leaf of the plan (i.e., takes no inputs) and wants to behave as a table and participate in pushdown can implement the applyTableFunction
call and then rely on applyXXX for TableHandle
.
We definitely didn't want to squeeze everything into TableHandle abstraction (and that have a matrix of pushdown-related ifs and in page sources), that's why initial CDF PTF implementation waited for the PTF execution support. But we definitely can go that route. |
@martint my approach was suggested by @kasiafi who implemented most of the table functions infrastructure - also it is described in a I understand your concern, it is not great to have all those applyXXX functions but I am afraid table functions won't be useful if they aren't performant. |
This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua |
Description
Implement predicate pushdown for table functions - it is necessary to improve performance of queries using table_changes on delta lake
Additional context and related issues
Release notes
(x) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text: