From e5b789180906300816711a71c0aaeb7439917348 Mon Sep 17 00:00:00 2001 From: Terry Blessing Date: Thu, 23 Jun 2022 17:02:06 -0500 Subject: [PATCH 1/8] domain-compaction-threshold description --- docs/src/main/sphinx/connector/delta-lake.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst index 68121cd4a2b..5c48f5e881e 100644 --- a/docs/src/main/sphinx/connector/delta-lake.rst +++ b/docs/src/main/sphinx/connector/delta-lake.rst @@ -136,10 +136,10 @@ connector. - Description - Default * - ``delta.domain-compaction-threshold`` - - Sets the number of transactions to act as threshold. Once reached the - connector initiates compaction of the underlying files and the delta - files. A higher compaction threshold means reading less data from the - underlying data source, but a higher memory and network consumption. + - Sets the number of transactions to act as a threshold. After reaching + the threshold, the connector initiates compacting a large IN or OR + clause into a min-max range predicate for pushdown into an ORC or + Parquet reader. - 100 * - ``delta.max-outstanding-splits`` - The target number of buffered splits for each table scan in a query, From 106c2c8acad7ab6250452d688cdf71db6099c715 Mon Sep 17 00:00:00 2001 From: Terry Blessing Date: Fri, 24 Jun 2022 11:30:44 -0500 Subject: [PATCH 2/8] review feedback --- docs/src/main/sphinx/connector/delta-lake.rst | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst index 5c48f5e881e..3360eb018e0 100644 --- a/docs/src/main/sphinx/connector/delta-lake.rst +++ b/docs/src/main/sphinx/connector/delta-lake.rst @@ -136,10 +136,11 @@ connector. - Description - Default * - ``delta.domain-compaction-threshold`` - - Sets the number of transactions to act as a threshold. After reaching - the threshold, the connector initiates compacting a large IN or OR - clause into a min-max range predicate for pushdown into an ORC or - Parquet reader. + - Minimum size of query predicates above which Trino starts compaction. + Some databases perform poorly when a large list of predicates is pushed + down to the data source. For optimization in that situation, Trino can + compact the large predicates. When necessary, adjust the threshold to + ensure a balance between performance and pushdown. - 100 * - ``delta.max-outstanding-splits`` - The target number of buffered splits for each table scan in a query, From 8168f5de0971b5d5ac11b758578f0cf06c389421 Mon Sep 17 00:00:00 2001 From: Terry Blessing Date: Mon, 27 Jun 2022 12:49:30 -0500 Subject: [PATCH 3/8] Manfred feedback, do not refer to some DBs --- docs/src/main/sphinx/connector/delta-lake.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst index 3360eb018e0..816ae3ef138 100644 --- a/docs/src/main/sphinx/connector/delta-lake.rst +++ b/docs/src/main/sphinx/connector/delta-lake.rst @@ -137,9 +137,9 @@ connector. - Default * - ``delta.domain-compaction-threshold`` - Minimum size of query predicates above which Trino starts compaction. - Some databases perform poorly when a large list of predicates is pushed - down to the data source. For optimization in that situation, Trino can - compact the large predicates. When necessary, adjust the threshold to + Pushing a large list of predicates down to the data source can + compromise performance. For optimization in that situation, Trino can + compact the large predicates. If necessary, adjust the threshold to ensure a balance between performance and pushdown. - 100 * - ``delta.max-outstanding-splits`` From 252c72c8beb1d69c71dc81fcc7a03bc9bfbfc826 Mon Sep 17 00:00:00 2001 From: Terry Blessing Date: Mon, 27 Jun 2022 14:21:55 -0500 Subject: [PATCH 4/8] Format option consistency --- docs/src/main/sphinx/connector/delta-lake.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst index 816ae3ef138..4c9cfa89799 100644 --- a/docs/src/main/sphinx/connector/delta-lake.rst +++ b/docs/src/main/sphinx/connector/delta-lake.rst @@ -167,8 +167,8 @@ connector. - ``32MB`` * - ``delta.max-split-size`` - Sets the largest :ref:`prop-type-data-size` for a single read section - assigned to a worker after max-initial-splits have been processed. You - can also use the corresponding catalog session property + assigned to a worker after ``max-initial-splits`` have been processed. + You can also use the corresponding catalog session property ``.max_split_size``. - ``64MB`` From 6a05b47c3b608a3576326a2c0ea10d060ac601ee Mon Sep 17 00:00:00 2001 From: Terry Blessing Date: Mon, 27 Jun 2022 14:33:33 -0500 Subject: [PATCH 5/8] fix punctuation --- docs/src/main/sphinx/connector/delta-lake.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst index 4c9cfa89799..d544a93e81b 100644 --- a/docs/src/main/sphinx/connector/delta-lake.rst +++ b/docs/src/main/sphinx/connector/delta-lake.rst @@ -401,7 +401,7 @@ to register them:: ) Columns listed in the DDL, such as ``dummy`` in the preceeding example, are -ignored. The table schema is read from the transaction log, instead. If the +ignored. The table schema is read from the transaction log instead. If the schema is changed by an external system, Trino automatically uses the new schema. @@ -459,7 +459,7 @@ Write operations are supported for tables stored on the following systems: Writes to :doc:`Amazon S3 ` and S3-compatible storage must be enabled with the ``delta.enable-non-concurrent-writes`` property. Writes to S3 can - safely be made from multiple Trino clusters, however write collisions are not + safely be made from multiple Trino clusters; however, write collisions are not detected when writing concurrently from other Delta Lake engines. You need to make sure that no concurrent data modifications are run to avoid data corruption. @@ -479,7 +479,7 @@ Table statistics You can use :doc:`/sql/analyze` statements in Trino to populate the table statistics in Delta Lake. Number of distinct values (NDV) -statistics are supported, while Minimum value, maximum value, and null value +statistics are supported; while Minimum value, maximum value, and null value count statistics are not supported. The :doc:`cost-based optimizer ` then uses these statistics to improve query performance. @@ -538,7 +538,7 @@ disable it for a session, with the :doc:`catalog session property ` ``extended_statistics_enabled`` set to ``false``. If a table is changed with many delete and update operation, calling ``ANALYZE`` -does not result in accurate statistics. To correct the statistics you have to +does not result in accurate statistics. To correct the statistics, you have to drop the extended stats and analyze table again. Use the ``system.drop_extended_stats`` procedure in the catalog to drop the @@ -590,7 +590,7 @@ this property is ``0s``. There is a minimum retention session property as well, Memory monitoring """"""""""""""""" -When using the Delta Lake connector you need to monitor memory usage on the +When using the Delta Lake connector, you need to monitor memory usage on the coordinator. Specifically monitor JVM heap utilization using standard tools as part of routine operation of the cluster. @@ -617,5 +617,5 @@ Following is an example result: node | trino-master object_name | io.trino.plugin.deltalake.transactionlog:type=TransactionLogAccess,name=delta -In a healthy system both ``datafilemetadatacachestats.hitrate`` and +In a healthy system, both ``datafilemetadatacachestats.hitrate`` and ``metadatacachestats.hitrate`` are close to ``1.0``. From 9f8867157537285588984397b38d32c4457f7463 Mon Sep 17 00:00:00 2001 From: Terry Blessing Date: Mon, 27 Jun 2022 14:37:16 -0500 Subject: [PATCH 6/8] minor edit --- docs/src/main/sphinx/connector/delta-lake.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst index d544a93e81b..1f718ad57a0 100644 --- a/docs/src/main/sphinx/connector/delta-lake.rst +++ b/docs/src/main/sphinx/connector/delta-lake.rst @@ -140,7 +140,7 @@ connector. Pushing a large list of predicates down to the data source can compromise performance. For optimization in that situation, Trino can compact the large predicates. If necessary, adjust the threshold to - ensure a balance between performance and pushdown. + ensure a balance between performance and predicate pushdown. - 100 * - ``delta.max-outstanding-splits`` - The target number of buffered splits for each table scan in a query, From 6a38289cd42890d42ae2babee26fb57ffd502f42 Mon Sep 17 00:00:00 2001 From: Terry Blessing Date: Mon, 27 Jun 2022 15:58:19 -0500 Subject: [PATCH 7/8] minor edits --- docs/src/main/sphinx/connector/delta-lake.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst index 1f718ad57a0..f394b8042a7 100644 --- a/docs/src/main/sphinx/connector/delta-lake.rst +++ b/docs/src/main/sphinx/connector/delta-lake.rst @@ -32,7 +32,7 @@ metastore configuration properties as the :doc:`Hive connector `. At a minimum, ``hive.metastore.uri`` must be configured. The connector recognizes Delta tables created in the metastore by the Databricks -runtime. If non-Delta tables are present in the metastore, as well, they are not +runtime. If non-Delta tables are present in the metastore as well, they are not visible to the connector. To configure the Delta Lake connector, create a catalog properties file, for @@ -479,7 +479,7 @@ Table statistics You can use :doc:`/sql/analyze` statements in Trino to populate the table statistics in Delta Lake. Number of distinct values (NDV) -statistics are supported; while Minimum value, maximum value, and null value +statistics are supported; whereas minimum value, maximum value, and null value count statistics are not supported. The :doc:`cost-based optimizer ` then uses these statistics to improve query performance. From 829881f7b2f5c09cd21954af7080dd63fc921c4b Mon Sep 17 00:00:00 2001 From: "Terry L. Blessing" Date: Mon, 27 Jun 2022 17:15:29 -0500 Subject: [PATCH 8/8] Update delta-lake.rst fine-tune merge conflict --- docs/src/main/sphinx/connector/delta-lake.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst index 8ee2f710112..8c091b12560 100644 --- a/docs/src/main/sphinx/connector/delta-lake.rst +++ b/docs/src/main/sphinx/connector/delta-lake.rst @@ -519,7 +519,7 @@ Table statistics ^^^^^^^^^^^^^^^^ You can use :doc:`/sql/analyze` statements in Trino to populate the table -statistics in Delta Lake. Number of distinct values (NDV) +statistics in Delta Lake. Data size and number of distinct values (NDV) statistics are supported; whereas minimum value, maximum value, and null value count statistics are not supported. The :doc:`cost-based optimizer ` then uses these statistics to improve