diff --git a/docs/src/main/sphinx/connector/delta-lake.rst b/docs/src/main/sphinx/connector/delta-lake.rst index a01bf234267..8c091b12560 100644 --- a/docs/src/main/sphinx/connector/delta-lake.rst +++ b/docs/src/main/sphinx/connector/delta-lake.rst @@ -40,7 +40,7 @@ metastore configuration properties as the :doc:`Hive connector `. At a minimum, ``hive.metastore.uri`` must be configured. The connector recognizes Delta tables created in the metastore by the Databricks -runtime. If non-Delta tables are present in the metastore, as well, they are not +runtime. If non-Delta tables are present in the metastore as well, they are not visible to the connector. To configure the Delta Lake connector, create a catalog properties file, for @@ -173,10 +173,11 @@ connector. - Description - Default * - ``delta.domain-compaction-threshold`` - - Sets the number of transactions to act as threshold. Once reached the - connector initiates compaction of the underlying files and the delta - files. A higher compaction threshold means reading less data from the - underlying data source, but a higher memory and network consumption. + - Minimum size of query predicates above which Trino starts compaction. + Pushing a large list of predicates down to the data source can + compromise performance. For optimization in that situation, Trino can + compact the large predicates. If necessary, adjust the threshold to + ensure a balance between performance and predicate pushdown. - 100 * - ``delta.max-outstanding-splits`` - The target number of buffered splits for each table scan in a query, @@ -203,8 +204,8 @@ connector. - ``32MB`` * - ``delta.max-split-size`` - Sets the largest :ref:`prop-type-data-size` for a single read section - assigned to a worker after max-initial-splits have been processed. You - can also use the corresponding catalog session property + assigned to a worker after ``max-initial-splits`` have been processed. + You can also use the corresponding catalog session property ``.max_split_size``. - ``64MB`` * - ``delta.minimum-assigned-split-weight`` @@ -441,7 +442,7 @@ to register them:: ) Columns listed in the DDL, such as ``dummy`` in the preceeding example, are -ignored. The table schema is read from the transaction log, instead. If the +ignored. The table schema is read from the transaction log instead. If the schema is changed by an external system, Trino automatically uses the new schema. @@ -499,7 +500,7 @@ Write operations are supported for tables stored on the following systems: Writes to :doc:`Amazon S3 ` and S3-compatible storage must be enabled with the ``delta.enable-non-concurrent-writes`` property. Writes to S3 can - safely be made from multiple Trino clusters, however write collisions are not + safely be made from multiple Trino clusters; however, write collisions are not detected when writing concurrently from other Delta Lake engines. You need to make sure that no concurrent data modifications are run to avoid data corruption. @@ -519,7 +520,7 @@ Table statistics You can use :doc:`/sql/analyze` statements in Trino to populate the table statistics in Delta Lake. Data size and number of distinct values (NDV) -statistics are supported, while Minimum value, maximum value, and null value +statistics are supported; whereas minimum value, maximum value, and null value count statistics are not supported. The :doc:`cost-based optimizer ` then uses these statistics to improve query performance. @@ -578,7 +579,7 @@ disable it for a session, with the :doc:`catalog session property ` ``extended_statistics_enabled`` set to ``false``. If a table is changed with many delete and update operation, calling ``ANALYZE`` -does not result in accurate statistics. To correct the statistics you have to +does not result in accurate statistics. To correct the statistics, you have to drop the extended stats and analyze table again. Use the ``system.drop_extended_stats`` procedure in the catalog to drop the @@ -630,7 +631,7 @@ this property is ``0s``. There is a minimum retention session property as well, Memory monitoring """"""""""""""""" -When using the Delta Lake connector you need to monitor memory usage on the +When using the Delta Lake connector, you need to monitor memory usage on the coordinator. Specifically monitor JVM heap utilization using standard tools as part of routine operation of the cluster. @@ -657,5 +658,5 @@ Following is an example result: node | trino-master object_name | io.trino.plugin.deltalake.transactionlog:type=TransactionLogAccess,name=delta -In a healthy system both ``datafilemetadatacachestats.hitrate`` and +In a healthy system, both ``datafilemetadatacachestats.hitrate`` and ``metadatacachestats.hitrate`` are close to ``1.0``.