Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hide table layouts from engine #363

Merged
merged 15 commits into from
Mar 9, 2019
Merged

Hide table layouts from engine #363

merged 15 commits into from
Mar 9, 2019

Conversation

martint
Copy link
Member

@martint martint commented Mar 2, 2019

In preparation for https://github.com/prestosql/presto/wiki/Pushdown-of-complex-operations, this change removes support for multiple layouts. It's a feature that's not used by any known connector, complicates the work of the optimizer and just gets in the way. In the future, we may add this functionality again but in a different form.

@cla-bot cla-bot bot added the cla-signed label Mar 2, 2019
@martint martint changed the title [WIP] Hide table layouts from engine Hide table layouts from engine Mar 2, 2019
Copy link
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment, but otherwise looks good.

// Used during predicate refinement over multiple passes of predicate pushdown
// TODO: think about how to get rid of this in new planner
private final TupleDomain<ColumnHandle> currentConstraint;

private final TupleDomain<ColumnHandle> enforcedConstraint;

public static TableScanNode newInstance(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strange design, which I think predates this change. Maybe add a comment on why this factory is different from the constructor with the same signature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing it out. I need to see if I can get rid of it. Without it, two of the constructors had the same shape (but ended up doing different things). I may be able to consolidate them now that we don’t have table layouts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out I can't. I tried making the static factory method the one that's used to deserialized from json, but it needs to set some fields to null, which is disallowed by the full constructor. Oh well.. I added a comment to clarify.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we used to have such constructor and it used to work. What was wrong with it?

Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please refresh my memory and explain why TableLayout* were introduced in the first place and why it no longer holds?


import static java.util.Objects.requireNonNull;

public final class TableHandle
{
private final ConnectorId connectorId;
private final ConnectorTableHandle connectorHandle;
private final ConnectorTransactionHandle transaction;
private final Optional<ConnectorTableLayoutHandle> layout;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it looks like to much of information to be stored in TableHandle. To me TableHandle represents an handle to a table, same way like a column handle. It is expected to be lightweight. What you have here is designed for very specific use case, hence maybe a new entity should be created for that.

Notice that you are planning to put even more information here (aggregations, joins, projections), the more you add here the less "table"-like it becomes, instead it becomes a more generic "relation" (no longer a table). Also notice that TableHandle is used for insert (perfect fit usage) and now with your changes it becomes questionable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it looks like to much of information to be stored in TableHandle. To me TableHandle represents an handle to a table, same way like a column handle. It is expected to be lightweight. What you have here is designed for very specific use case, hence maybe a new entity should be created for that.

Storing the TableLayout is only temporary. I plan to eventually get rid of them entirely, but didn't do so yet because it requires backward incompatible changes to the SPI.

But at the end of the day, yes, the plan is for TableHandle to potentially represent more than raw tables. The object doesn't need to be heavyweight -- it's up to the connector to decide what it holds on to internally (it could be just an id to an stored definition, etc). As to whether it becomes more "relation"-like, that's not much of a concern. In fact, the SQL spec treats everything as some form of a table:

A table is a collection of zero or more rows where each row is a sequence of one or more column values. 

[...] 

A table is either a base table, a derived table, or a transient table.

[...]

A derived table is a table derived directly or indirectly from one or more other tables by the evaluation of an expression, 
such as a <joined table>, <data change delta table>, <query expression>, or <table expression>. A <query expression> 
can contain an optional <order by clause>

Also notice that TableHandle is used for insert (perfect fit usage) and now with your changes it becomes questionable.

That's ok, because whether a table can be inserted into is decided at analysis time. So you'd never end up with a TableHandle to insert into that represents something that can't be inserted into. BTW, did you know the SQL spec allows for inserting into things that are more complex that simple tables? For example, there's an entire section describing "updatable views".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Anyway, I still think that TableHandle was a simple wrapper around ConnectorTableHandle and there are still places where it is used as such. Now, they will get a complex type, so they would need to at least check if the layout is empty there.

Storing the TableLayout is only temporary. I plan to eventually get rid of them entirely, but didn't do so yet because it requires backward incompatible changes to the SPI.

But there will be something else that will replace TableLayout. So TableHandle won't become any simpler.

So you'd never end up with a TableHandle to insert into that represents something that can't be inserted into.

Having separate a type would make expressive in the code that such case is impossible (without knowing the actual flow).

BTW, did you know the SQL spec allows for inserting into things that are more complex that simple tables?

Mind blowing. Anyway, I don't think that TableHandle that is suited for read will match that.

@martint
Copy link
Member Author

martint commented Mar 4, 2019

Can you please refresh my memory and explain why TableLayout* were introduced in the first place and why it no longer holds?

It's a feature we added a few years ago for a very specific use case at FB to allow connectors to provide multiple physical organizations of a table for the engine to use during optimization. No one else is using it. Since it makes it hard to introduce new features like complex operation pushdown, we're removing it for now. We may reintroduce it later in some other shape (materialized query tables/materialized views, etc).

@kokosing
Copy link
Member

kokosing commented Mar 4, 2019

Would you mind to sort commits, that way it will be easier to review.

@martint
Copy link
Member Author

martint commented Mar 4, 2019

Would you mind to sort commits, that way it will be easier to review.

done

Copy link
Member

@kokosing kokosing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for all commits but Hide TableLayouts from engine


import static java.util.Objects.requireNonNull;

public final class TableHandle
{
private final ConnectorId connectorId;
private final ConnectorTableHandle connectorHandle;
private final ConnectorTransactionHandle transaction;
private final Optional<ConnectorTableLayoutHandle> layout;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Anyway, I still think that TableHandle was a simple wrapper around ConnectorTableHandle and there are still places where it is used as such. Now, they will get a complex type, so they would need to at least check if the layout is empty there.

Storing the TableLayout is only temporary. I plan to eventually get rid of them entirely, but didn't do so yet because it requires backward incompatible changes to the SPI.

But there will be something else that will replace TableLayout. So TableHandle won't become any simpler.

So you'd never end up with a TableHandle to insert into that represents something that can't be inserted into.

Having separate a type would make expressive in the code that such case is impossible (without knowing the actual flow).

BTW, did you know the SQL spec allows for inserting into things that are more complex that simple tables?

Mind blowing. Anyway, I don't think that TableHandle that is suited for read will match that.

@@ -29,15 +29,22 @@

public class TableLayoutResult
{
private final TableHandle newTableHandle;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why newTableHanlde and not just tableHandle?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just to indicate that this is the new table handle to be used to refer to the result of pushing the predicates into the table scan. It's named that way for clarity, but it doesn't matter in the medium term since I'm planning to remove this class altogether once the replacement from https://github.com/prestosql/presto/wiki/Pushdown-of-complex-operations is in place and we give connectors some time to migrate.

{
this.newTableHandle = newTable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rnn

@@ -90,20 +90,14 @@ public PickTableLayout(Metadata metadata, SqlParser parser)
{
return ImmutableSet.of(
checkRulesAreFiredBeforeAddExchangesRule(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still required? I would replace with comment in PlanOptimizers. There is no reason to prevent this class to be invoked later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. I'll remove it.

// Used during predicate refinement over multiple passes of predicate pushdown
// TODO: think about how to get rid of this in new planner
private final TupleDomain<ColumnHandle> currentConstraint;

private final TupleDomain<ColumnHandle> enforcedConstraint;

public static TableScanNode newInstance(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we used to have such constructor and it used to work. What was wrong with it?

@martint
Copy link
Member Author

martint commented Mar 5, 2019

But we used to have such constructor and it used to work. What was wrong with it?

Before, we had two constructors:

@JsonCreator
public TableScanNode(
        @JsonProperty("id") PlanNodeId id,
        @JsonProperty("table") TableHandle table,
        @JsonProperty("outputSymbols") List<Symbol> outputs,
        @JsonProperty("assignments") Map<Symbol, ColumnHandle> assignments,
        @JsonProperty("layout") Optional<TableLayoutHandle> tableLayout)

and

public TableScanNode(
        PlanNodeId id,
        TableHandle table,
        List<Symbol> outputs,
        Map<Symbol, ColumnHandle> assignments)

After removing table layouts, both constructors have the same signature, but do something different.

@martint
Copy link
Member Author

martint commented Mar 5, 2019

I see your point. Anyway, I still think that TableHandle was a simple wrapper around ConnectorTableHandle and there are still places where it is used as such. Now, they will get a complex type, so they would need to at least check if the layout is empty there.

It will still be a simple wrapper around ConnectorTableHandle once we finish removing table layouts from the SPI. But we can'd do that quite yet -- we need to 1) add the new APIs and mechanism for predicate pushdown 2) have a grace period for connectors to adapt (or do it automatically, if possible)

But there will be something else that will replace TableLayout. So TableHandle won't become any simpler.

No, there won't. It will be just TableHandles wrapping ConnectorTableHandles. It's up to connectors to decide what they need to track in a ConnectorTableHandle based on what can be pushed down into the connector.

Having separate a type would make expressive in the code that such case is impossible (without knowing the actual flow).

That's really hard to do in general. It forces entire parallel hierarchies and APIs to represent similar objects with different constraints (e.g., we'd need separate APIs to resolve a table during analysis for each possible use so we can get a table handle for insert, a table handle for delete, etc.). Also, where do we stop? Would we have one of each expression kind for each SQL type to prevent misuse? It's a fine balance, and I think the current concepts provide good separation between action (expression/plan node type), object (columns, tables, arguments) and attributes (physical or logical properties) without requiring a NxMxR explosion of valid combinations.

@martint martint force-pushed the tl-wip3 branch 2 times, most recently from 858edcb to 6773bb3 Compare March 7, 2019 03:54
@martint martint mentioned this pull request Mar 7, 2019
Copy link
Member

@sopel39 sopel39 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well done!

@@ -25,15 +25,22 @@

public class TableLayoutResult
{
private final TableHandle newTableHandle;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this field since we still have layout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Layout doesn't have a reference to the ConnectorTableHandle that's needed to construct the new TableHandle. I could add that, but it seemed conceptually cleaner to do it this way.

martint added 2 commits March 7, 2019 09:47
More accurately, constructs a set of alternate plans with the
filter pushed into the table scan based on available table layouts.
@@ -144,12 +145,14 @@ public static RowExpression translate(
FunctionRegistry functionRegistry,
TypeManager typeManager,
Session session,
boolean optimize)
boolean optimize,
Map<Symbol, Integer> layout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this below type

}
});

Block block = Utils.nativeValueToBlock(expectedType, result);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment // convert result from stack type to Type ObjectValue

martint added 5 commits March 7, 2019 10:59
This test relies on a layout being picked during planning. For
LocalQueryRunner this only happens because of a synthetic rule
(PickLayout w/o predicate) that attaches a layout to the table scan.
So, in a sense, it's just a coincidence that it works.

In distributed execution, the job is done by AddExchanges, so
we want to make sure we're testing that behavior.
It's just selecting a layout for the raw table scan, so
no need to go through the logic for pushing a predicate,
etc.
@martint martint force-pushed the tl-wip3 branch 2 times, most recently from 2eebec6 to f10d5f2 Compare March 8, 2019 21:24
martint added 8 commits March 8, 2019 19:51
This change hides table layouts from the engine as a first-class
concept. We keep the SPI as is for backward compatibility for now.

When predicates are pushed into a table scan by PickLayout (now
PushPredicateIntoTableScan) or AddExchanges, we now replace the
table handle associated with the table scan with a new one that
contains the reference to the ConnectorTableLayoutHandle under
the covers.
It now contains a single rule, so no point in
having it return a rule set.
In order to translate expression to row expressions, the
code was first replacing all symbol references with field
references for the corresponding ordinal inputs.

This is unnecessary, as the translation can be done on demand
as the expression is translated to a row expression.
The inferred type of the former expression is INTEGER, which
doesn't match the signature of combineHash function call.
They were only being used in tests. The engine no longer
relies on them for query execution.
There's no longer a conflict with analyzeExpressionsWithInputs so
simplify the name
There's only one caller, so no need for an extra indirection.
@martint martint merged commit 2a23e10 into trinodb:master Mar 9, 2019
@@ -252,10 +251,10 @@ public Object evaluate()
return visitor.process(expression, new NoPagePositionContext());
}

public Object evaluate(int position, Page page)
public Object evaluate(SymbolResolver inputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this evaluate and the optimize now? If you provide SymbolResolver, using optimize would be just fine, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants