Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ALTER TABLE EXECUTE + Hive OPTIMIZE (explicit TableExecuteNode) #9665

Merged
merged 8 commits into from
Oct 25, 2021

Conversation

losipiuk
Copy link
Member

@losipiuk losipiuk commented Oct 16, 2021

PR adds SPI and execution support for new table procedures.

New syntax extends ALTER TABLE family allowing for proper semantic
analysis of target table (unlike what we have with CALL when table
schema and table name are passed as string literals via table
arguments).

New syntax example:

  • ALTER TABLE <table> EXECUTE procedure
  • ALTER TABLE <table> EXECUTE procedure(value1, value2, ...)
  • ALTER TABLE <table> EXECUTE procedure(param1 => value1, param2 => value2, ...) WHERE ...

New table procedures allow for rewriting table data which makes them
feasible for implementing data cleansing routines like:

  • compacting small files into larger ones for HIVE table
  • changing files sorting or bucketing scheme for a table
  • Iceberg OPTIMIZE

Currently exectuion flow which does not rewrite table data is not
implemented. It will be implemented as a followup. Then current
procedures available via CALL will be migrated to new mechanism.

Procedures are exposed via connectors to engine via a set of new SPI
methods and classes:

  • io.trino.spi.connector.TableProcedureMetadata
  • io.trino.spi.connector.TableProcedureExecutionMode
  • io.trino.spi.connector.ConnectorTableExecuteHandle
  • io.trino.spi.connector.ConnectorMetadata#getTableHandleForExecute
  • io.trino.spi.connector.ConnectorMetadata#getLayoutForTableExecute
  • io.trino.spi.connector.ConnectorMetadata#beginTableExecute
  • io.trino.spi.connector.ConnectorMetadata#finishTableExecute

@cla-bot cla-bot bot added the cla-signed label Oct 16, 2021
@losipiuk losipiuk force-pushed the lo/distributed-dml-3 branch 4 times, most recently from d2629fe to 2851668 Compare October 17, 2021 08:58
@losipiuk losipiuk changed the title Lo/distributed dml 3 Draft: ALTER TABLE ANALYZE + Hive OPTIMIZE (explicit TableExecuteNode) Oct 17, 2021
@losipiuk losipiuk force-pushed the lo/distributed-dml-3 branch from 2851668 to 39a35ed Compare October 17, 2021 09:37
@losipiuk losipiuk marked this pull request as ready for review October 19, 2021 08:01
@losipiuk losipiuk changed the title Draft: ALTER TABLE ANALYZE + Hive OPTIMIZE (explicit TableExecuteNode) ALTER TABLE EXECUTE + Hive OPTIMIZE (explicit TableExecuteNode) Oct 19, 2021
Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of editorials, nothing material. good job

Comment on lines -310 to -315
PlanNode source = rewriteModifyTableScan(((SemiJoinNode) node).getSource(), handle);
return replaceChildren(node, ImmutableList.of(source, ((SemiJoinNode) node).getFilteringSource()));
}
if (node instanceof JoinNode) {
PlanNode source = rewriteModifyTableScan(((JoinNode) node).getLeft(), handle);
return replaceChildren(node, ImmutableList.of(source, ((JoinNode) node).getRight()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we required that updated TS is on the left side of Join / SemiJoin.
I think this is an important property. I don't imagine how we could get a wrong SemiJoin, but getting flipped Join is certainly possible. cc @kasiafi
Is it still validated somewhere? or are we going to hit

checkState(source.isPresent(), "UpdatablePageSource not set");
?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that is still enforced by logic in findTableScanHandleForDeleteOrUpdate


import static java.util.Objects.requireNonNull;

public final class TableExecuteHandle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this handle identify, exactly? It's not clear from the name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated javadoc a bit. Hope it is better now.

@@ -21,6 +21,7 @@
import io.trino.operator.OperationTimer.OperationTiming;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit should be folded into whichever commit introduce execution for ALTER TABLE EXECUTE. Otherwise, that other commit would be "broken".

In fact, I would squash all the commits related to adding the new feature into a single one before it gets merged. You can leave the unrelated changes separate (as long as they have value and stand on their own)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I would squash all the commits related to adding the new feature into a single one before it gets merged

Yeah - that is the plan.

@jhlodin
Copy link
Contributor

jhlodin commented Oct 19, 2021

Docs PR here, can make additional updates as needed but otherwise waiting until this code is merged. #9682

@losipiuk losipiuk force-pushed the lo/distributed-dml-3 branch 3 times, most recently from f42d8f4 to fadee63 Compare October 21, 2021 10:21
@losipiuk
Copy link
Member Author

AC. I will squash commits that are not self-contained later today or tomorrow. @martint please let me know if you want to dig deeper into this one.

@losipiuk losipiuk force-pushed the lo/distributed-dml-3 branch 2 times, most recently from de2bbc2 to aed3887 Compare October 22, 2021 14:30
@losipiuk
Copy link
Member Author

Squashed implementation commmits and rebased on top of current master

@losipiuk losipiuk force-pushed the lo/distributed-dml-3 branch from aed3887 to 8469bd4 Compare October 22, 2021 14:32
Allow using different key than catalog name for storing properites
metadata. This is preparatory work for introducing table procedures,
where each table procedure, even from single catalog, may allow
different set of properties.
Commit adds SPI and execution support for new table procedures.

New syntax extends ALTER TABLE family allowing for proper semantic
analysis of target table (unlike what we have with CALL when table
schema and table name are passed as string literals via table
arguments).

New syntax example:
 * ALTER TABLE <table> EXECUTE
 * ALTER TABLE <table> EXECUTE(value1, value2, ...)
 * ALTER TABLE <table> EXECUTE(param1 => value1, param2 => value2, ...) WHERE ...

New table procedures allow for rewriting table data which makes them
feasible for implementing data cleansing routines like:
 * compacting small files into larger ones for HIVE table
 * changing files sorting or bucketing scheme for a table
 * Iceberg OPTIMIZE

Currently exectuion flow which _does not_ rewrite table data is not
implemented. It will be implemented as a followup. Then current
procedures available via `CALL` will be migrated to new mechanism.

Procedures are exposed via connectors to engine via a set of new SPI
methods and classes:
 * io.trino.spi.connector.TableProcedureMetadata
 * io.trino.spi.connector.TableProcedureExecutionMode
 * io.trino.spi.connector.ConnectorTableExecuteHandle
 * io.trino.spi.connector.ConnectorMetadata#getTableHandleForExecute
 * io.trino.spi.connector.ConnectorMetadata#getLayoutForTableExecute
 * io.trino.spi.connector.ConnectorMetadata#beginTableExecute
 * io.trino.spi.connector.ConnectorMetadata#finishTableExecute
Add support for compacting small files for non-transactional,
non-bucketed Hive tables.

ALTER TABLE xxxxx EXECUTE OPTIMIZE WITH(file_size_threshold = ...)
@losipiuk losipiuk force-pushed the lo/distributed-dml-3 branch from 8469bd4 to 57e48d2 Compare October 25, 2021 08:56
@losipiuk losipiuk merged commit 2fe7084 into trinodb:master Oct 25, 2021
@github-actions github-actions bot added this to the 364 milestone Oct 25, 2021
@losipiuk losipiuk mentioned this pull request Oct 26, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants