Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document OPTIMIZE for Iceberg #10790

Merged
merged 4 commits into from
Feb 23, 2022
Merged

Conversation

findinpath
Copy link
Contributor

No description provided.

@cla-bot cla-bot bot added the cla-signed label Jan 25, 2022
@findinpath findinpath force-pushed the docs/iceberg-optimize branch 3 times, most recently from da9659c to cbbedbc Compare January 25, 2022 14:41
Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me have a look again after you updated.

docs/src/main/sphinx/connector/iceberg.rst Show resolved Hide resolved
ALTER TABLE EXECUTE
^^^^^^^^^^^^^^^^^^^

The connector supports the following commands for use with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simplify .. something like this:

The connector supports collapsing files in a table into fewer larger files with identical data. This operation improves performance and reduces disk usage.

All files with a size below the optional file_size_threshold parameter are collapsed. The default value for the threshold is 100MB:

.. code-block:: sql

ALTER TABLE test_table EXECUTE OPTIMIZE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did reword the paragraph with your suggestion, but kept the reference towards partitioning. I think it is important to mention that the collapsing takes place per partition in case of partitioned tables.

docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
@findinpath findinpath force-pushed the docs/iceberg-optimize branch from cbbedbc to 2f5b181 Compare January 26, 2022 06:39
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The connector offers :ref:`ALTER TABLE EXECUTE <alter-table-execute>`
``OPTIMIZE`` command for collapsing files which correspond to the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi please do provide feedback in case that I'm wrong in the assumption that OPTIMIZE touches only the files which correspond to the current snapshot of the table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes it reads off of current snapshot
  2. talking about snapshots here is not necessary. Do you specify that UPDATE or SELECT operate on current snapshot? worse, it can be misunderstood, as if OPTIMIZE was replacing current snapshot, or rewriting the files in situ

^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The connector offers :ref:`ALTER TABLE EXECUTE <alter-table-execute>`
``OPTIMIZE`` command for collapsing files which correspond to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes it reads off of current snapshot
  2. talking about snapshots here is not necessary. Do you specify that UPDATE or SELECT operate on current snapshot? worse, it can be misunderstood, as if OPTIMIZE was replacing current snapshot, or rewriting the files in situ

docs/src/main/sphinx/connector/iceberg.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/sql/alter-table.rst Outdated Show resolved Hide resolved
docs/src/main/sphinx/sql/alter-table.rst Outdated Show resolved Hide resolved
@findinpath findinpath force-pushed the docs/iceberg-optimize branch 2 times, most recently from 988ee44 to 4d24677 Compare January 27, 2022 13:20
@findinpath
Copy link
Contributor Author

talking about snapshots here is not necessary. Do you specify that UPDATE or SELECT operate on current snapshot?

@findepi think that because this command acts only on the current snapshot, it is worth mentioning this aspect. Not mentioning it could imply for a user not knowing the internals of the Iceberg connector that all the existing snapshots of the table are compacted.

@findinpath findinpath force-pushed the docs/iceberg-optimize branch from 4d24677 to c5b5734 Compare January 27, 2022 13:33
@findepi
Copy link
Member

findepi commented Jan 27, 2022

talking about snapshots here is not necessary. Do you specify that UPDATE or SELECT operate on current snapshot?

@findepi think that because this command acts only on the current snapshot, it is worth mentioning this aspect. Not mentioning it could imply for a user not knowing the internals of the Iceberg connector that all the existing snapshots of the table are compacted.

User not knowing internals of iceberg won't know & won't care about snapshots.
We should keep snapshot conversation as focused & isolated as possible.

@findinpath findinpath force-pushed the docs/iceberg-optimize branch from c5b5734 to 147a867 Compare January 31, 2022 06:57
@findinpath
Copy link
Contributor Author

User not knowing internals of iceberg won't know & won't care about snapshots.
We should keep snapshot conversation as focused & isolated as possible.

This is the current state that I have:

The connector offers :ref:ALTER TABLE EXECUTE <alter-table-execute>
optimize command for creating a new snapshot of the table
which contains identical data to the current snapshot with the
content merged into potentially fewer larger files.

@findepi I don't know how can I exclude the snapshot reference out of the description above without degrading the description of the command.

@findepi
Copy link
Member

findepi commented Jan 31, 2022

@findepi I don't know how can I exclude the snapshot reference out of the description above without degrading the description of the command.

How do we talk about inserts, updates and deletes?

@findinpath findinpath force-pushed the docs/iceberg-optimize branch from 147a867 to dfe48ea Compare February 3, 2022 15:29
@findinpath findinpath requested review from findepi and mosabua February 3, 2022 15:30
@findinpath
Copy link
Contributor Author

@findepi I took out "snapshot" from the documentation. Please do have another look on the PR.

@findepi
Copy link
Member

findepi commented Feb 7, 2022

@findinpath please rebase, there is a conflict.

Comment on lines 132 to 139
ALTER TABLE EXECUTE optimize
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The connector offers :ref:`ALTER TABLE EXECUTE <alter-table-execute>`
``optimize`` command for rewriting the current content of the specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At risk of overriding your work so far, we have a similar section that documents ALTER TABLE EXECUTE for the Hive connector: https://trino.io/docs/current/connector/hive.html#alter-table-execute

For the sake of consistency, we should make these as similar as possible. Are there any differences in implementation on ALTER TABLE EXECUTE on Hive vs on Iceberg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimize procedure is disabled by default, and can be enabled for a catalog with the <catalog name>.non_transactional_optimize_enabled session property.

This property is not required for the Iceberg connector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any differences in implementation on ALTER TABLE EXECUTE on Hive vs on Iceberg?

@findepi can you please answer this question? I am not familiar enough with the Hive implementation of OPTIMIZE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any differences in implementation on ALTER TABLE EXECUTE on Hive vs on Iceberg?

They are independent. Hive's is a dangerous, non-transactional operation that can, in case of failure, lead to corrupted table state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any differences in implementation on ALTER TABLE EXECUTE on Hive vs on Iceberg?

They are independent. Hive's is a dangerous, non-transactional operation that can, in case of failure, lead to corrupted table state.

Would the content in https://trino.io/docs/current/connector/hive.html#alter-table-execute make sense here then, just without the warning note about it being a non-transactional operation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhlodin mostly, yes, except for minor wording #10790 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findinpath Can you update the contents of this commit to reuse the Hive documentation? Minus the note as mentioned, and with whatever other feedback there is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have aligned both Hive and Iceberg documentation regarding optimize command to follow the same structure.
I've taken the liberty of adding along the way a few more usage examples for the command in the Hive documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

@findinpath findinpath force-pushed the docs/iceberg-optimize branch from dfe48ea to 23c4665 Compare February 8, 2022 07:40
@findinpath findinpath force-pushed the docs/iceberg-optimize branch from 23c4665 to cc14ad0 Compare February 8, 2022 17:11
Copy link
Contributor

@jhlodin jhlodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending comments

Comment on lines 144 to 153
The ``optimize`` command is used for rewriting the active content
of the specified table so that it is merged into potentially
fewer larger files.
In case that the table is partitioned, the data compaction
acts separately on each partition selected for optimization.
This operation improves read performance and reduces disk usage.

All files with a size below the optional ``file_size_threshold``
parameter (default value for the threshold is ``100MB``) are
merged:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this introduction is better than what we had before in Hive, can you apply this there as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly did a rewording for Hive because the optimize operation applies only for non-transactional tables.

docs/src/main/sphinx/sql/alter-table.rst Outdated Show resolved Hide resolved
@findinpath findinpath force-pushed the docs/iceberg-optimize branch 2 times, most recently from 485714d to 7d61f68 Compare February 8, 2022 17:27
@jhlodin
Copy link
Contributor

jhlodin commented Feb 14, 2022

Hey all, are we ready to merge this PR?

@findinpath findinpath requested a review from findepi February 14, 2022 18:11
@findinpath findinpath force-pushed the docs/iceberg-optimize branch from 7d61f68 to 3f90332 Compare February 15, 2022 11:21
@findinpath findinpath requested a review from losipiuk February 15, 2022 11:21
@findinpath findinpath force-pushed the docs/iceberg-optimize branch from 3f90332 to db59b49 Compare February 15, 2022 11:27
@losipiuk losipiuk merged commit 1d74483 into trinodb:master Feb 23, 2022
@github-actions github-actions bot added this to the 372 milestone Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants