diff --git a/presto-docs/src/main/sphinx/admin.rst b/presto-docs/src/main/sphinx/admin.rst index eab299a0a392..0af2c6e6a3c0 100644 --- a/presto-docs/src/main/sphinx/admin.rst +++ b/presto-docs/src/main/sphinx/admin.rst @@ -7,5 +7,6 @@ Administration admin/web-interface admin/tuning + admin/properties admin/queue admin/resource-groups diff --git a/presto-docs/src/main/sphinx/admin/properties.rst b/presto-docs/src/main/sphinx/admin/properties.rst new file mode 100644 index 000000000000..29ac6c4abbed --- /dev/null +++ b/presto-docs/src/main/sphinx/admin/properties.rst @@ -0,0 +1,493 @@ +================= +Presto properties +================= + +This is a list and description of most important presto properties that may be used to tune Presto or alter it behavior when required. + + +.. _tuning-pref-general: + +General properties +------------------ + + +``distributed-joins-enabled`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``true`` + * **Description:** Use hash distributed joins instead of broadcast joins. Distributed joins require redistributing both tables using a hash of the join key. This can be slower (sometimes substantially) than broadcast joins but allows much larger joins. Broadcast joins require that the tables on the right side of the join after filtering fit in memory on each node whereas distributed joins only need to fit in distributed memory across all nodes. This can also be specified on a per-query basis using the ``distributed_join`` session property. + + +``redistribute-writes`` +^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``true`` + * **Description:** Turning on this property causes additional rehashing of data before writing them to connector. This eliminates performance impact of data skewness when writing to disk by distributing data before write (usually I/O operation). It can be disabled when it's known that data set is not skewed in order to save time on rehashing operation. This can also be specified on a per-query basis using the ``redistribute_writes`` session property. + + +``resources.reserved-system-memory`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``JVM max memory`` * ``0.4`` + * **Description:** Maximum amount of memory available to each Presto node. Reaching this limit will cause the server to drop operations. Higher value may increase Presto's stability, but may cause problems if physical server is used for other purposes. If too much memory is allocated to Presto, the operating system may terminate the process. + + +``sink.max-buffer-size`` +^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``32 MB`` + * **Description:** Buffer size for IO writes while collecting pipeline results from cluster node. Increasing this value may improve the speed of IO operations, but will take memory away from other functions. Buffered data will be lost if the node crashes, so using a large value is not recommended when the environment is unstable. + + +.. _tuning-pref-exchange: + +Exchange properties +------------------- + +The Exchange service is responsible for transferring data between Presto nodes. Adjusting these properties may help to resolve inter-node communication issues or improve network utilization. + +``exchange.client-threads`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``25`` + * **Description:** Number of threads that the exchange server can spawn to handle clients. Higher value will increase concurrency but excessively high values may cause a drop in performance due to context switches and additional memory usage. + + +``exchange.concurrent-request-multiplier`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``3`` + * **Description:** Multiplier determining how many clients of the exchange server may be spawned relative to available buffer memory. The number of possible clients is determined by heuristic as the number of clients that can fit into available buffer space based on average buffer usage per request times this multiplier. For example with the ``exchange.max-buffer-size`` of ``32 MB`` and ``20 MB`` already used, and average bytes per request being ``2MB`` up to ``exchange.concurrent-request-multipier`` * ((``32MB`` - ``20MB``) / ``2MB``) = ``exchange.concurrent-request-multiplier`` * ``6`` may be spawned. Tuning this value adjusts the heuristic, which may increase concurrency and improve network utilization. + + +``exchange.max-buffer-size`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``32 MB`` + * **Description:** Size of memory block reserved for the client buffer in exchange server. Lower value may increase processing time under heavy load. Increasing this value may improve network utilization, but will reduce the amount of memory available for other activities. + + +``exchange.max-response-size`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size, at least ``1 MB``) + * **Default value:** ``16 MB`` + * **Description:** Max size of messages sent through the exchange server. The size of message headers is included in this value, so the amount of data sent per message will be a little lower. Increasing this value may improve network utilization if the network is stable. In an unstable network environment, making this value smaller may improve stability. + + +.. _tuning-pref-node: + +Node scheduler properties +------------------------- + +``node-scheduler.max-pending-splits-per-node-per-stage`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``10`` + * **Description:** Must be smaller than ``node-scheduler.max-splits-per-node``. This property describes how many splits can be queued to each worker node. Having this value higher will allow more jobs to be queued but will cause resources to be used for that. Using a higher value is recommended if queries are submitted in large batches, (eg. running a large group of reports periodically). Increasing this value may help to avoid query drops and decrease the risk of short query starvation. Too high value may drastically increase processing wall time if node distribution of query work will be skew. This is especially important if nodes do have important differences in performance. The best value for that is enough to provide at least one split always waiting to be process but not higher. + + +``node-scheduler.max-splits-per-node`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``100`` + * **Description:** This property limits the number of splits that can be scheduled for each node. Increasing this value will allow the cluster to process more queries or reduce visibility of problems connected to data skew. Excessively high values may result in poor performance due to context switching and higher memory reservation for cluster metadata. + + +``node-scheduler.min-candidates`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``10`` + * **Description:** The minimal number of node candidates check by scheduler when looking for a node to schedule a split. Having this value to low may increase skew of work distribution between nodes. Too high value may increase latency of query and CPU load. The value should be aligned with number of nodes in cluster. + + +``node-scheduler.multiple-tasks-per-node-enabled`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** Allow nodes to be selected multiple times by the node scheduler in a single stage. With this property set to ``false`` the ``hash_partition_count`` is capped at number of nodes in system. Having this set to ``true`` may allow better scheduling and concurrency, which would reduce the number of outliers and speed up computations. It may also improve reliability in unstable network conditions. The drawbacks are that some optimization may work less efficiently on smaller partitions. Also slight hardware efficiency drop is expected in heavy loaded system. + +.. _node-scheduler-network-topology: + +``node-scheduler.network-topology`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (``legacy`` or ``flat``) + * **Default value:** ``legacy`` + * **Description:** Sets the network topology to use when scheduling splits. ``legacy`` will ignore the topology when scheduling splits. ``flat`` will try to schedule splits on the host where the data is located by reserving 50% of the work queue for local splits. It is recommended to use ``flat`` for clusters where distributed storage runs on the same nodes as Presto workers. + + +.. _tuning-pref-optimizer: + +Optimizer properties +-------------------- + +``optimizer.processing-optimization`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (``disabled``, ``columnar`` or ``columnar_dictionary``) + * **Default value:** ``disabled`` + * **Description:** Setting this property changes how filtering and projection operators are processed. Setting it to ``columnar`` allows Presto to use columnar processing instead of row by row. Setting ``columnar_dictionary`` adds additional dictionary to simplify columnar scan. Setting this to a value other than ``disabled`` may improve performance for data containing large rows often filtered by a simple key. This can also be specified on a per-query basis using the ``processing_optimization`` session property. + +``optimizer.dictionary-aggregation`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** Enables optimization for aggregations on dictionaries. This can also be specified on a per-query basis using the ``dictionary_aggregation`` session property. + + +``optimizer.optimize-hash-generation`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``true`` + * **Description:** Compute hash codes for distribution, joins, and aggregations early in the query plan allowing result to be shared between operations later in the plan. While this will increase the preprocessing time, it may allow the optimizer to drop some computations later in query processing. In most cases it will decrease overall query processing time. This can also be specified on a per-query basis using the ``optimize_hash_generation`` session property. + + +``optimizer.optimize-metadata-queries`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** Setting this property to ``true`` enables optimization of some aggregations by using values that are kept in metadata. This allows Presto to execute some simple queries in ``O(1)`` time. Currently this optimization applies to ``max``, ``min`` and ``approx_distinct`` of partition keys and other aggregation insensitive to the cardinality of the input (including ``DISTINCT`` aggregates). Using this may speed some queries significantly, though it may have a negative effect when used with very small data sets. Also it may cause incorrect/not accurate/invalid results in some backend db, especially in Hive when there are partition without any rows. + + +``optimizer.optimize-single-distinct`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``true`` + * **Description:** Enables the single distinct optimization. This optimization will try to replace multiple DISTINCT clauses with a single GROUP BY clause. Enabling this optimization will speed up some specific SELECT queries, but analyzing all queries to check if they qualify for this optimization may be a slight overhead. + + +``optimizer.push-table-write-through-union`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``true`` + * **Description:** Parallelize writes when using UNION ALL in queries that write data. This improves the speed of writing output tables in UNION ALL queries because these writes do not require additional synchronization when collecting results. Enabling this optimization can improve UNION ALL speed when write speed is not yet saturated. However it may slow down queries in an already heavily loaded system. This can also be specified on a per-query basis using the ``push_table_write_through_union`` session property. + + +.. _tuning-pref-query: + +Query execution properties +-------------------------- + + +``query.execution-policy`` +^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (``all-at-once`` or ``phased``) + * **Default value:** ``all-at-once`` + * **Description:** Setting this value to ``phased`` will allow the query scheduler to split a single query execution between different time slots. This will allow Presto to switch context more often and possibly stage the partially executed query in order to increase robustness. Average time to execute a query may slightly increase after setting this to ``phased``, but query execution time will be more consistent. This can also be specified on a per-query basis using the ``execution_policy`` session property. + + +``query.initial-hash-partitions`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``100`` + * **Description:** This value is used to determine how many nodes may share the same query when fixed partitioning is chosen by Presto. Manipulating this value will affect the distribution of work between nodes. A value lower then the number of Presto nodes may lower the utilization of the cluster in a low traffic environment. An excessively high value will cause multiple partitions of the same query to be assigned to a single node, or Presto may ignore the setting if ``node-scheduler.multiple-tasks-per-node-enabled`` is set to false - the value is internally capped at the number of available worker nodes in such scenario. This can also be specified on a per-query basis using the ``hash_partition_count`` session property. + + +``query.low-memory-killer.delay`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (duration, at least ``5s``) + * **Default value:** ``5 m`` + * **Description:** Delay between a cluster running low on memory and invoking a query killer. A lower value may cause more queries to fail fast, but fewer queries to fail in an unexpected way. + + +``query.low-memory-killer.enabled`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** This property controls whether a query killer should be triggered when a cluster is running out of memory. The killer will drop the largest queries first so enabling this option may cause problems with executing large queries in a highly loaded cluster, but should increase stability of smaller queries. + + +``query.manager-executor-pool-size`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``5`` + * **Description:** Size of the thread pool used for garbage collecting after queries. Threads from this pool are used to free resources from canceled queries, as well as enforce memory limits, queries timeouts etc. More threads will allow for more efficient memory management, and so may help avoid out of memory exceptions in some scenarios. However, having more threads may also increase CPU usage for garbage collecting and will have an additional constant memory cost even if the threads have nothing to do. + + +``query.min-expire-age`` +^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (duration) + * **Default value:** ``15 m`` + * **Description:** This property describes the minimum time after which the query metadata may be removed from the server. If the value is too low, the client may not be able to receive information about query completion. The value describes minimum time, but if there is space available in the history queue the query data will be kept longer. The size of the history queue is defined by the ``query.max-history property``. + + +``query.max-concurrent-queries`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``1000`` + * **Description:** **Deprecated** Describes how many queries can be processed simultaneously in a single cluster node. In new configurations, the ``query.queue-config-file`` should be used instead. + + +.. _query-max-memory: + +``query.max-memory`` +^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``20 GB`` + * **Description:** Serves as the default value for the ``query_max_memory`` session property. This property also describes the strict limit of total memory that may be used to process a single query. A query is dropped if the limit is reached unless the ``resource_overcommit`` session property is set. This property helps ensure that a single query cannot use all resources in a cluster. It should be set higher than what is expected to be needed for a typical query in the system. It is important to set this to higher than the default if Presto will be running complex queries on large datasets. It is possible to decrease the query memory limit for a session by setting ``query_max_memory`` to a smaller value. Setting ``query_max_memory`` to a greater value than ``query.max-memory`` will not have any effect. + + +``query.max-memory-per-node`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``JVM max memory`` * ``0.1`` + * **Description:** The purpose of that is same as of :ref:`query.max-memory` but the memory is not counted cluster-wise but node-wise instead. This should not be any lower than ``query.max-memory / number of nodes``. It may be required to increase this value if data are skewed. + + +``query.max-queued-queries`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``5000`` + * **Description:** **Deprecated** Describes how many queries may wait in Presto coordinator queue. If the limit is reached the server will drop all new incoming queries. Setting this value high may allow to order a lot of queries at once with the cost of additional memory needed to keep informations about tasks to process. Lowering this value will decrease system capacity but will allow to utilize memory for real processing of data instead of queuing. It shouldn't be used in new configuration, the ``query.queue-config-file`` can be used instead. + + +``query.max-run-time`` +^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (duration) + * **Default value:** ``100 d`` + * **Description:** Used as default for session property ``query_max_run_time``. If the Presto works in environment where there are mostly very long queries (over 100 days) than it may be a good idea to increase this value to avoid dropping clients that didn't set their session property correctly. On the other hand in the Presto works in environment where they are only very short queries this value set to small value may be used to detect user errors in queries. It may also be decreased in poor Presto cluster configuration with mostly short queries to increase garbage collection efficiency and by that lowering memory usage in cluster. + + +``query.queue-config-file`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` + * **Default value:** + * **Description:** The path to the queue config file. Queues are used to manage the number of concurrent queries across the system. More information on queues and how to configure them can be found in :doc:/admin/queue. + + +``query.remote-task.max-callback-threads`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``1000`` + * **Description:** This value describes the maximum size of the thread pool used to handle responses to HTTP requests for each task. Increasing this value will cause more resources to be used for handling HTTP communication itself, but may also improve response time when Presto is distributed across many hosts or there are a lot of small queries being run. + + +``query.remote-task.min-error-duration`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (duration, at least ``1s``) + * **Default value:** ``2 m`` + * **Description:** The minimal time that HTTP worker must be unavailable before the coordinator assumes the worker crashed. A higher value may be recommended in unstable connection conditions. This value is only a bottom line so there is no guarantee that a node will be considered dead after the ``query.remote-task.min-error-duration``. In order to consider a node dead, the defined time must pass between two failed attempts of HTTP communication, with no successful communication in between. + + +``query.schedule-split-batch-size`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``1000`` + * **Description:** The size of single data chunk expressed in split that will be processed in a single stage. Higher value may be used if system works in reliable environment and the responsiveness is less important then average answer time, it will require more memory reserve though. Decreasing this value may have a positive effect if there are lots of nodes in system and calculations are relatively heavy for each of splits. + + +.. _tuning-pref-task: + +Tasks managment properties +-------------------------- + + +.. _task-concurrency: + +``task.concurrency`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``1`` + * **Description:** Default local concurrency for parallel operators. Serves as the default value for the ``task_concurrency`` session property. Increasing this value is strongly recommended when any of CPU, IO or memory is not saturated on a regular basis. It will allow queries to utilize as many resources as possible. Setting this value too high will cause queries to slow down. Slow down may happen even if none of the resources is saturated as there are cases in which increasing parallelism is not possible due to algorithms limitations. + +``task.info-refresh-max-wait`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + * **Type:** ``String`` (duration) + * **Default value:** ``1s`` + * **Description:** Controls staleness of task information, which is used in scheduling. Increasing this value can reduce coordinator CPU load, but may result in suboptimal split scheduling. + + +``task.http-response-threads`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``100`` + * **Description:** Max number of threads that may be created to handle http responses. Threads are created on demand and they end when there is no response to be sent. That means that there is no overhead if there are only a small number of requests handled by the system, even if this value is big. On the other hand increasing this value may increase utilization of CPU in multicore environment (with the cost of memory usage). Also in systems having a lot of requests, the response time distribution may be manipulated using this property. A higher value may be used to prevent outliers from increasing average response time. + + +``task.http-timeout-threads`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``3`` + * **Description:** Number of threads spawned for handling timeouts of http requests. Presto server sends update of query status whenever it is different then the one that client knows about. However in order to ensure client that connection is still alive, server sends this data after delay declared internally in HTTP headers (by default ``200 ms``). This property tells how many threads are designated to handle this delay. If the property turn out to low it's possible that the update time will increase even significantly when comparing to requested value (``200ms``). Increasing this value may solve the problem, but it generate a cost of additional memory even if threads are not used all the time. If there is no problem with updating status of query this value should not be manipulated. + + +``task.info-update-interval`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (duration) + * **Default value:** ``200 ms`` + * **Description:** Controls staleness of task information which is used in scheduling. Increasing this value can reduce coordinator CPU load but may result in suboptimal split scheduling. + + +``task.max-partial-aggregation-memory`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``16 MB`` + * **Description:** Max size of partial aggregation result (if it is splitable). Increasing this value will decrease the fragmentation of the result which may improve query run times and CPU utilization with the cost of additional memory usage. Also a high value may cause a drop in performance in unstable cluster conditions. + + + +``task.max-worker-threads`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``Node CPUs`` * ``2`` + * **Description:** Sets the number of threads used by workers to process splits. Increasing this number can improve throughput if worker CPU utilization is low and all the threads are in use, but will cause increased heap space usage. Too high value may cause drop in performance due to a context switching. The number of active threads is available via the ``com.facebook.presto.execution.TaskExecutor.RunningSplits`` JMX stat. + + +``task.min-drivers`` +^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``Node CPUs`` * ``4`` + * **Description:** This describes how many drivers are kept on a worker at any time. A lower value may cause better responsiveness for new tasks, but decrease CPU utilization. A higher value makes context switching faster, but uses additional memory. In general, if it is possible to assign a split to a driver, it is assigned if: there are fewer than ``3`` drivers assigned to the given task OR there are fewer drivers on the worker than ``task.min-drivers`` OR the task has been enqueued with the ``force start`` property. + + +``task.operator-pre-allocated-memory`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``16 MB`` + * **Description:** Memory preallocated for each driver in query execution. Increasing this value may cause less efficient memory usage but will fail fast in a low memory environment more frequently. + + +``task.writer-count`` +^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``1`` + * **Description:** The number of concurrent writer threads per worker per query. Serves as the default for the session property ``task_writer_count``. Increasing this value may increase write speed, especially when a query is NOT I/O bounded and could use more CPU cores for parallel writes. However, in many cases increasing this value will visibly increase computation time while writing. + + + +.. _tuning-pref-session: + +Session properties +------------------ + +``processing_optimization`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (``disabled``, ``columnar`` or ``columnar_dictionary``) + * **Default value:** ``optimizer.processing-optimization`` (``false``) + * **Description:** See :ref:`optimizer.processing-optimization`. + + +``execution_policy`` +^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (``all-at-once`` or ``phased``) + * **Default value:** ``query.execution-policy`` (``all-at-once``) + * **Description:** See :ref:`query.execution-policy `. + + +``hash_partition_count`` +^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``query.initial-hash-partitions`` (``100``) + * **Description:** See :ref:`query.initial-hash-partitions `. + + +``optimize_hash_generation`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``optimizer.optimize-hash-generation`` (``true``) + * **Description:** See :ref:`optimizer.optimize-hash-generation `. + + +``plan_with_table_node_partitioning`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``true`` + * **Description:** **Experimental.** Adapt plan to use backend partitioning. When this is set, presto will try to partition data for workers such that each worker gets a chunk of data from a single backend partition. This enables workers to take advantage of the I/O distribution optimization in table partitioning. Note that this property is only used if a given projection uses all columns used for table partitioning inside connector. + + + +``push_table_write_through_union`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``optimizer.push-table-write-through-union`` (``true``) + * **Description:** See :ref:`optimizer.push-table-writethrough-union `. + + +``query_max_memory`` +^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``query.max-memory`` (``20 GB``) + * **Description:** This property can be use to be nice to the cluster if a particular query is not as important as the usual cluster routines. Setting this value to less than the server property ``query.max-memory`` will cause Presto to drop the query in the session if it will require more then ``query_max_memory`` memory. Setting this value to higher than ``query.max-memory`` will not have any effect. + + + +``query_max_run_time`` +^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (duration) + * **Default value:** ``query.max-run-time`` (``100 d``) + * **Description:** If the expected query processing time is higher than ``query.max-run-time``, it is crucial to set this session property to prevent results of long running queries being dropped after ``query.max-run-time``. A session may also set this value to less than ``query.max-run-time`` in order to crosscheck for bugs in the query. Setting this value less than ``query.max-run-time`` may be particularly useful for a session with a very large number of short-running queries. It is important to set this value to much higher than the average query time to avoid problems with outliers (some queries may randomly take much longer due to cluster load and other circumstances). As the query timed out by this limit immediately returns all used resources this may be particularly useful in query management systems to force user limits. + + +``resource_overcommit`` +^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** Use resources that are not guaranteed to be available to a query. This property allows you to exceed the limits of memory available per query and session. It may allow resources to be used more efficiently, but may also cause non-deterministic query drops due to insufficient memory on machine. It can be particularly useful for performing more demanding queries. + + +``task_concurrency`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (power of 2). + * **Default value:** ``task.concurrency`` (``1``) + * **Description:** Default number of local parallel aggregation jobs per worker. Unlike `task.concurrency` this property must be power of two. See :ref:`task.concurrency`. + + +``task_writer_count`` +^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``task.writer-count`` (``1``) + * **Description:** See :ref:`task.writer-count `. + + diff --git a/presto-docs/src/main/sphinx/admin/tuning.rst b/presto-docs/src/main/sphinx/admin/tuning.rst index 05440137aba7..69a1ed24c59b 100644 --- a/presto-docs/src/main/sphinx/admin/tuning.rst +++ b/presto-docs/src/main/sphinx/admin/tuning.rst @@ -5,30 +5,23 @@ Tuning Presto The default Presto settings should work well for most workloads. The following information may help you if your cluster is facing a specific performance problem. -Config Properties ------------------ - -These configuration options may require tuning in specific situations: - -* ``task.max-worker-threads``: - Sets the number of threads used by workers to process splits. Increasing this number - can improve throughput if worker CPU utilization is low and all the threads are in use, - but will cause increased heap space usage. The number of active threads is available via - the ``com.facebook.presto.execution.TaskExecutor.RunningSplits`` JMX stat. - -* ``distributed-joins-enabled``: - Use hash distributed joins instead of broadcast joins. Distributed joins - require redistributing both tables using a hash of the join key. This can - be slower (sometimes substantially) than broadcast joins, but allows much - larger joins. Broadcast joins require that the tables on the right side of - the join fit in memory on each machine, whereas with distributed joins the - tables on the right side have to fit in distributed memory. This can also be - specified on a per-query basis using the ``distributed_join`` session property. - -* ``node-scheduler.network-topology``: - Sets the network topology to use when scheduling splits. "legacy" will ignore - the topology when scheduling splits. "flat" will try to schedule splits on the same - host as the data is located by reserving 50% of the work queue for local splits. +When setting up a cluster for a specific workload it may be necessary to adjust the +following properties to ensure optimal performance: + + * :ref:`join-distribution-type ` + * :ref:`query.max-memory ` + * :ref:`query.max-memory-per-node ` + * :ref:`query.initial-hash-partitions ` + * :ref:`task.concurrency ` + +Those and other Presto properties are described in :doc:`properties article`. + +As an example, on a 11-node cluster dedicated to Presto (1 Coordinator + 10 Workers) with 8-core CPU and 128GB of RAM per node, you might want to start tuning with following values: + + * `query.max-memory = 200GB` + * `query.max-memory-per-node = 32GB` + * `query.initial-hash-partitions = 10` + * `task.concurrency = 8` JVM Settings ------------ diff --git a/presto-docs/src/main/sphinx/connector/hive.rst b/presto-docs/src/main/sphinx/connector/hive.rst index f1e376603019..9b15ca8aee02 100644 --- a/presto-docs/src/main/sphinx/connector/hive.rst +++ b/presto-docs/src/main/sphinx/connector/hive.rst @@ -132,10 +132,7 @@ Property Name Description ``hive.compression-codec`` The compression codec to use when writing files. ``GZIP`` -``hive.force-local-scheduling`` Force splits to be scheduled on the same node as the Hadoop ``false`` - DataNode process serving the split data. This is useful for - installations where Presto is collocated with every - DataNode. +``hive.force-local-scheduling`` See :ref:`tuning section` ``false`` ``hive.respect-table-format`` Should new partitions be written using the existing table ``true`` format or the default Presto format? @@ -360,6 +357,212 @@ This table can then be queried in Presto:: SELECT * FROM hive.web.page_view; + +.. _tuning-pref-hive: + +Tuning +------- + +The following configuration properties may have an impact on connector performance: + +``hive.assume-canonical-partition-keys`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** Enable optimized metastore partition fetching for non-string partition keys. Setting this property allows to filter non-string partition keys while reading them from hive, based on the assumption that they are stored in canonical (java) format. This is disabled by default as hive allows to use non-canonical format as well (eg. boolean value ``false`` may be represented as ``0``, ``false``, ``False`` and more). Used correctly this property may drastically improve read time by reducing number of partition loaded from hive. Setting this property for non-canonical data format may cause erratic behavior. + + +``hive.domain-compaction-threshold`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``100`` + * **Description:** Maximum number of ranges/values allowed while reading hive data without compacting it. A higher value will cause more data fragmentation but allow the use of the row skipping feature when reading ORC data. Increasing this value may have a large impact on ``IN`` and ``OR`` clause performance in scenarios making use of row skipping. + + +.. _force-local-scheduling: + +``hive.force-local-scheduling`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** Force splits to be scheduled on the same node (ignoring normal node selection procedures) as the Hadoop DataNode process serving the split data. This is useful for installations where Presto is collocated with every DataNode and may increase queries time significantly. The drawback may be that if some data are accessed more often, the utilization of some nodes may be low even if the whole system is heavy loaded. See also :ref:`node-scheduler.network-topology` if less strict constrain is preferred - especially if some nodes are overloaded and other are not fully utilized. + + +``hive.max-initial-split-size`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``hive.max-split-size`` / ``2`` (``32 MB``) + * **Description:** This property describes the maximum size of the first ``hive.max-initial-splits`` splits created for a query. the logic behind initial splits is described in ``hive.max-initial-splits``. Lower values will increase concurrency for small queries. This property represents the maximum size, as the real size may be lower when the amount of data to read is less than ``hive.max-initial-split-size`` (e.g. at the end of a block on a DataNode). + + +``hive.max-initial-splits`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` + * **Default value:** ``200`` + * **Description:** This property describes how many splits may be initially created for a single query using ``hive.max-initial-split-size`` instead of ``hive.max-split-size``. A higher value will force more splits to have a smaller size (``hive.max-initial-splits`` is expected to be smaller than ``hive.max-split-size``), effectively increasing the definition of what is considered a "small query". The purpose of the smaller split size for the initial splits is to increase concurrency for smaller queries. + + +``hive.max-outstanding-splits`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``1000`` + * **Description:** Limit on the nubmer of splits waiting to be served by a split source. After reaching this limit, writers will stop writing new splits until some of hteme are used by workers. Higher values will increase memory usage, but allow IO to be concentrated at one time, which may be faster and increase resource utilization. + + +``hive.max-partitions-per-writers`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``100`` + * **Description:** Maximum number of partitions per writer. A query will fail if it requires more partitions per writer than allowed by this property. It can be helpful to have queries beyond the expected maximum partitions to fail to help with error detection. Also it may allow to preactivly avoid out of memory problem. + + +``hive.max-split-iterator-threads`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``1000`` + * **Description:** This property describes how many threads may be used to iterate through splits when loading them to the worker nodes. A higher value may increase parallelism, but increased concurrency may cause too much time to be spent on context switching. + + +``hive.max-split-size`` +^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``64 MB`` + * **Description:** The maximum size of splits created after the initial splits. The logic for initial splits is described in ``hive.max-initial-splits``. A higher value will reduce parallelism. This may be desirable for very large queries and a stable cluster because it allows for more efficient processing of local data without the context switching, synchronization and data collection that result from parallelization. The optimal value should be aligned with the average query size in the system. + + +``hive.metastore.partition-batch-size.max`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``100`` + * **Description:** This together with ``hive.metastore.partition-batch-size.min`` defines the range of partition sizes read from Hive. The first partition is always of size ``hive.metastore.partition-batch-size.min`` and each following partition is two times bigger than previous up to ``hive.mestastore.partition-batch-size.max`` (the formula for partition size ``n`` is min(``hive.metastore.partition-batch-size.max``, (``2``^``n``) * ``hive.metastore.partition-batch-size.min``)). This algorithm allows for live adjustment of partition size according to the processing requirements. If the queries in the system will differ significantly from each other in size, then this range should be extended to better adjust to processing requirements. If the queries in the system will mostly be of the same size, then setting both values to the same maximally tuned value may give a slight edge in processing time. + + +``hive.metastore.partition-batch-size.min`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``10`` + * **Description:** See ``hive.metastore.partition-batch-size.max``. + + +``hive.orc.max-buffer-size`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``8 MB`` + * **Description:** Serves as default value for ``orc_max_buffer_size`` session properties defining max size of ORC read operators. Higher value will allow bigger chunks to be processed but will decrease concurrency level. + + +``hive.orc.max-merge-distance`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``1 MB`` + * **Description:** Serves as the default value for the ``orc_max_merge_distance`` session property. Two reads from an ORC file may be merged into a single read if the distance between the requested data ranges in the data source is less than or equal to this value. + + +``hive.orc.stream-buffer-size`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``8 MB`` + * **Description:** Serves as the default value for the ``orc_max_buffer_size`` session property. It defines the maximum size of ORC read operators. A higher value will allow bigger chunks to be processed, but will decrease concurrency. + + +``hive.orc.use-column-names`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** Access ORC columns using names from the file. By default, Hive access columns in ORC files using the order recoded in the Hive metastore. Setting this property allows to use columns names recorded in the ORC file instead. + + +.. _parquet-optimized-reader: + +``hive.parquet-optimized-reader.enabled`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** *Deprecated* Serves as default value for ``parquet_optimized_reader_enabled`` session property. Enables number of reader improvements introduced by alternative parquet implementation. The new reader supports vectorized reads, lazy loading, and predicate push down, all of which make the reader more efficient and typically reduces wall clock time for a query. However as the code has changed significantly it may or may not introduce some minor issues, so it can be disabled if some problems with environment are noticed. This property enables/disables all optimizations except predicate push down as it is managed by ``hive.parquet-predicate-pushdown.enabled`` property. + + +``hive.parquet-predicate-pushdown.enabled`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** *Deprecated* Serves as default value for ``parquet_predicate_pushdown_enabled`` sesssion property. See :ref:`hive.parquet-optimized-reader.enabled`. + + +``hive.parquet.use-column-names`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Boolean`` + * **Default value:** ``false`` + * **Description:** Access Parquet columns using names from the file. By default, columns in Parquet files are accessed by their ordinal position in the Hive metastore. Setting this property allows access by column name recorded in the Parquet file instead. + + +``hive.s3.max-connections`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``Integer`` (at least ``1``) + * **Default value:** ``500`` + * **Description:** The maximum number of connections to S3 that may be open at a time by the S3 driver. A higher value may increase network utilization when a cluster is used on a high speed network. However, a higher values relies more on S3 servers being well configured for high parallelism. + + +``hive.s3.multipart.min-file-size`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size, at least ``16 MB``) + * **Default value:** ``16 MB`` + * **Description:** This property describes how big a file must be to be uploaded to an S3 cluster using the multipart upload feature. Amazon recommends using ``100 MB``, but a lower value may increase upload parallelism and decrease the ``data lost``/``data sent`` ratio in unstable network conditions. + + +``hive.s3.multipart.min-part-size`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size, at least ``5 MB``) + * **Default value:** ``5 MB`` + * **Description:** Defines the minimum part size for upload parts. Decreasing the minimum part size causes multipart uploads to be split into a larger number of smaller parts. Setting this value too low has a negative effect on transfer speeds, causing extra latency and network communication for each part. + + +There are also following session properties allowing to control connector behavior on single query basis: + +``orc_max_buffer_size`` +^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``hive.orc.max-buffer-size`` (``8 MB``) + * **Description:** See :ref:`hive.orc.max-buffer-size `. + + +``orc_max_merge_distance`` +^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``hive.orc.max-merge-distance`` (``1 MB``) + * **Description:** See :ref:`hive.orc.max-merge-distance `. + + +``orc_stream_buffer_size`` +^^^^^^^^^^^^^^^^^^^^^^^^^^ + + * **Type:** ``String`` (data size) + * **Default value:** ``hive.orc.max-buffer-size`` (``8 MB``) + * **Description:** See :ref:`hive.orc.max-buffer-size `. + + Hive Connector Limitations --------------------------