From bc860b41b7605109a152e5ba9fbc90db375fd29d Mon Sep 17 00:00:00 2001
From: Jonathan Buttner <jonathan.buttner@elastic.co>
Date: Mon, 6 May 2024 11:54:06 -0400
Subject: [PATCH] Finish settings

---
 .../settings/inference-settings.asciidoc      | 284 +++---------------
 1 file changed, 50 insertions(+), 234 deletions(-)
diff --git a/docs/reference/settings/inference-settings.asciidoc b/docs/reference/settings/inference-settings.asciidoc
index e0b4d048e0a98..2cccd1e771b20 100644
--- a/docs/reference/settings/inference-settings.asciidoc
+++ b/docs/reference/settings/inference-settings.asciidoc
@@ -1,7 +1,7 @@
 
 [role="xpack"]
 [[inference-settings]]
-=== Machine learning settings in Elasticsearch
+=== Inference API settings in Elasticsearch
 ++++
 <titleabbrev>Inference settings</titleabbrev>
 ++++
@@ -9,16 +9,32 @@
 [[inference-settings-description]]
 // tag::inference-settings-description-tag[]
 You do not need to configure any settings to use the {infer} APIs. Each setting has a default set.
-
 // end::inference-settings-description-tag[]
 
 [discrete]
-[[general-inference-settings]]
-==== General inference API settings
+[[xpack-inference-logging]]
+// tag::inference-logging[]
+==== Inference API logging settings
+
+When certain failures occur, a log message is emitted. In the case of a
+reoccurring failure the logging throttler restricts repeated messages from being logged.
+
+`xpack.inference.logging.reset_interval`::
+(<<cluster-update-settings,Dynamic>>) Specifies the interval for when a cleanup thread will clear an internal
+cache of the previously logged messages. Defaults to one day (`1d`).
+
+`xpack.inference.logging.wait_duration`::
+(<<cluster-update-settings,Dynamic>>) Specifies the amount of time to wait after logging a message before that
+message can be logged again. Defaults to one hour (`1h`).
+// end::inference-logging[]
+
+[[xpack-inference-http-settings]]
+// tag::inference-http-settings[]
+==== Inference API HTTP settings
 
 `xpack.inference.http.max_response_size`::
-(<<cluster-update-settings,Dynamic>>) Specifies the maximum size an HTTP response is allowed to have, defaults to
-`10mb`, the maximum configurable value is `50mb`.
+(<<cluster-update-settings,Dynamic>>) Specifies the maximum size in bytes an HTTP response is allowed to have,
+defaults to `10mb`, the maximum configurable value is `50mb`.
 
 `xpack.inference.http.max_total_connections`::
 (<<cluster-update-settings,Dynamic>>) Specifies the maximum number of connections the internal connection pool can
@@ -26,7 +42,7 @@ lease. Defaults to `50`.
 
 `xpack.inference.http.max_route_connections`::
 (<<cluster-update-settings,Dynamic>>) Specifies the maximum number of connections a single route can lease from
-the internal connection pool. If this setting is set to a value equal or greater than
+the internal connection pool. If this setting is set to a value equal to or greater than
 `xpack.inference.http.max_total_connections`, then a single third party service could lease all available
 connections and other third party services would be unable to lease connections. Defaults to `20`.
 
@@ -39,238 +55,38 @@ multiple third party service are contending for the available connections in the
 (<<cluster-update-settings,Dynamic>>) Specifies the maximum duration a connection can be unused before it is marked as
 idle and can be closed and removed from the shared connection pool. Defaults to one minute (`1m`).
 
+`xpack.inference.http.request_executor.queue_capacity`::
+(<<cluster-update-settings,Dynamic>>) Specifies the size of the internal queue for requests waiting to be sent. If
+the queue is full and a request is sent to the inference API, it will be rejected. Defaults to `2000`.
 
+[[xpack-inference-http-retry-settings]]
+==== Inference API HTTP Retry settings
 
-`node.roles: [ ml ]`::
-(<<static-cluster-setting,Static>>) Set `node.roles` to contain `ml` to identify
-the node as a _{ml} node_. If you want to run {ml} jobs, there must be at least
-one {ml} node in your cluster.
-+
-If you set `node.roles`, you must explicitly specify all the required roles for
-the node. To learn more, refer to <<modules-node>>.
-+
-[IMPORTANT]
-====
-* On dedicated coordinating nodes or dedicated master nodes, do not set
-the `ml` role.
-* It is strongly recommended that dedicated {ml} nodes also have the
-`remote_cluster_client` role; otherwise, {ccs} fails when used in {ml} jobs or
-{dfeeds}. See <<remote-node>>.
-====
-
-`xpack.ml.enabled`::
-(<<static-cluster-setting,Static>>) The default value (`true`) enables {ml} APIs
-on the node.
-+
-IMPORTANT: If you want to use {ml-features} in your cluster, it is recommended
-that you use the default value for this setting on all nodes.
-+
-If set to `false`, the {ml} APIs are disabled on the node. For example, the node
-cannot open jobs, start {dfeeds}, receive transport (internal) communication
-requests, or requests from clients (including {kib}) related to {ml} APIs. If
-`xpack.ml.enabled` is not set uniformly across all nodes in your cluster then you
-are likely to experience problems with {ml} functionality not fully working.
-+
-You must not use any {ml} functionality from ingest pipelines if `xpack.ml.enabled`
-is `false` on any node. Before setting `xpack.ml.enabled` to `false` on a node,
-consider whether you really meant to just exclude `ml` from the `node.roles`.
-Excluding `ml` from the <<node-roles,`node.roles`>> will stop the node from
-running {ml} jobs and NLP models, but it will still be aware that {ml} functionality
-exists. Setting `xpack.ml.enabled` to `false` should be reserved for situations
-where you cannot use {ml} functionality at all in your cluster due to hardware
-limitations as described <<inference-settings-description,above>>.
+When a third party service returns a failure code that is transient (e.g. 429), the request is retried by the inference
+API. These settings govern the retry behavior. When a request is retried, exponential backoff is used.
 
-`xpack.ml.inference_model.cache_size`::
-(<<static-cluster-setting,Static>>) The maximum inference cache size allowed.
-The inference cache exists in the JVM heap on each ingest node. The cache
-affords faster processing times for the `inference` processor. The value can be
-a static byte sized value (such as `2gb`) or a percentage of total allocated
-heap. Defaults to `40%`. See also <<model-inference-circuit-breaker>>.
+`xpack.inference.http.retry.initial_delay`::
+(<<cluster-update-settings,Dynamic>>) Specifies the initial delay before retrying a request. Defaults to one second
+(`1s`).
 
-[[xpack-interference-model-ttl]]
-// tag::interference-model-ttl-tag[]
-`xpack.ml.inference_model.time_to_live` {ess-icon}::
-(<<static-cluster-setting,Static>>) The time to live (TTL) for trained models in
-the inference model cache. The TTL is calculated from last access. Users of the
-cache (such as the inference processor or inference aggregator) cache a model on
-its first use and reset the TTL on every use. If a cached model is not accessed
-for the duration of the TTL, it is flagged for eviction from the cache. If a
-document is processed later, the model is again loaded into the cache. To update
-this setting in {ess}, see
-{cloud}/ec-add-user-settings.html[Add {es} user settings]. Defaults to `5m`.
-// end::interference-model-ttl-tag[]
+`xpack.inference.http.retry.max_delay_bound`::
+(<<cluster-update-settings,Dynamic>>) Specifies the maximum delay for a request. Defaults to five seconds (`5s`).
 
-`xpack.ml.max_inference_processors`::
-(<<cluster-update-settings,Dynamic>>) The total number of `inference` type
-processors allowed across all ingest pipelines. Once the limit is reached,
-adding an `inference` processor to a pipeline is disallowed. Defaults to `50`.
+`xpack.inference.http.retry.timeout`::
+(<<cluster-update-settings,Dynamic>>) Specifies the maximum amount of time a request can be retried.
+Once the request exceeds this time, the request will no longer be retried and a failure will be returned.
+Defaults to 30 seconds (`30s`).
+// end::inference-logging[]
 
-`xpack.ml.max_machine_memory_percent`::
-(<<cluster-update-settings,Dynamic>>) The maximum percentage of the machine's
-memory that {ml} may use for running analytics processes. These processes are
-separate to the {es} JVM. The limit is based on the total memory of the machine,
-not current free memory. Jobs are not allocated to a node if doing so would
-cause the estimated memory use of {ml} jobs to exceed the limit. When the
-{operator-feature} is enabled, this setting can be updated only by operator
-users. The minimum value is `5`; the maximum value is `200`. Defaults to `30`.
-+
---
-TIP: Do not configure this setting to a value higher than the amount of memory
-left over after running the {es} JVM unless you have enough swap space to
-accommodate it and have determined this is an appropriate configuration for a
-specialist use case. The maximum setting value is for the special case where it
-has been determined that using swap space for {ml} jobs is acceptable. The
-general best practice is to not use swap on {es} nodes.
+[[xpack-inference-input-text]]
+// tag::inference-input-text[]
+==== Inference API Input text
 
---
-
-`xpack.ml.max_model_memory_limit`::
-(<<cluster-update-settings,Dynamic>>) The maximum `model_memory_limit` property
-value that can be set for any {ml} jobs in this cluster. If you try to create a
-job with a `model_memory_limit` property value that is greater than this setting
-value, an error occurs. Existing jobs are not affected when you update this
-setting. If this setting is `0` or unset, there is no maximum
-`model_memory_limit` value. If there are no nodes that meet the memory
-requirements for a job, this lack of a maximum memory limit means it's possible
-to create jobs that cannot be assigned to any available nodes. For more
-information about the `model_memory_limit` property, see
-<<ml-put-job,Create {anomaly-jobs}>> or <<put-dfanalytics>>. Defaults to `0` if
-`xpack.ml.use_auto_machine_memory_percent` is `false`. If
-`xpack.ml.use_auto_machine_memory_percent` is `true` and
-`xpack.ml.max_model_memory_limit` is not explicitly set then it will default to
-the largest `model_memory_limit` that could be assigned in the cluster.
-
-[[xpack.ml.max_open_jobs]]
-`xpack.ml.max_open_jobs`::
-(<<cluster-update-settings,Dynamic>>) The maximum number of jobs that can run
-simultaneously on a node. In this context, jobs include both {anomaly-jobs} and
-{dfanalytics-jobs}. The maximum number of jobs is also constrained by memory
-usage. Thus if the estimated memory usage of the jobs would be higher than
-allowed, fewer jobs will run on a node. Prior to version 7.1, this setting was a
-per-node non-dynamic setting. It became a cluster-wide dynamic setting in
-version 7.1. As a result, changes to its value after node startup are used only
-after every node in the cluster is running version 7.1 or higher. The minimum
-value is `1`; the maximum value is `512`. Defaults to `512`.
-
-`xpack.ml.nightly_maintenance_requests_per_second`::
-(<<cluster-update-settings,Dynamic>>) The rate at which the nightly maintenance
-task deletes expired model snapshots and results. The setting is a proxy to the
-<<docs-delete-by-query-throttle,`requests_per_second`>> parameter used in the
-delete by query requests and controls throttling. When the {operator-feature} is
-enabled, this setting can be updated only by operator users. Valid values must
-be greater than `0.0` or equal to `-1.0`, where `-1.0` means a default value is
-used. Defaults to `-1.0`
-
-`xpack.ml.node_concurrent_job_allocations`::
-(<<cluster-update-settings,Dynamic>>) The maximum number of jobs that can
-concurrently be in the `opening` state on each node. Typically, jobs spend a
-small amount of time in this state before they move to `open` state. Jobs that
-must restore large models when they are opening spend more time in the `opening`
-state. When the {operator-feature} is enabled, this setting can be updated only
-by operator users. Defaults to `2`.
-
-[discrete]
-[[advanced-inference-settings]]
-==== Advanced machine learning settings
-
-These settings are for advanced use cases; the default values are generally
-sufficient:
-
-`xpack.ml.enable_config_migration`::
-(<<cluster-update-settings,Dynamic>>) Reserved. When the {operator-feature} is
-enabled, this setting can be updated only by operator users.
-
-`xpack.ml.max_anomaly_records`::
-(<<cluster-update-settings,Dynamic>>) The maximum number of records that are
-output per bucket. Defaults to `500`.
-
-`xpack.ml.max_lazy_ml_nodes`::
-(<<cluster-update-settings,Dynamic>>) The number of lazily spun up {ml} nodes.
-Useful in situations where {ml} nodes are not desired until the first {ml} job
-opens. If the current number of {ml} nodes is greater than or equal to this
-setting, it is assumed that there are no more lazy nodes available as the
-desired number of nodes have already been provisioned. If a job is opened and
-this setting has a value greater than zero and there are no nodes that can
-accept the job, the job stays in the `OPENING` state until a new {ml} node is
-added to the cluster and the job is assigned to run on that node. When the
-{operator-feature} is enabled, this setting can be updated only by operator
-users. Defaults to `0`.
-+
-IMPORTANT: This setting assumes some external process is capable of adding {ml}
-nodes to the cluster. This setting is only useful when used in conjunction with
-such an external process.
-
-`xpack.ml.max_ml_node_size`::
-(<<cluster-update-settings,Dynamic>>)
-The maximum node size for {ml} nodes in a deployment that supports automatic
-cluster scaling. If you set it to the maximum possible size of future {ml} nodes,
-when a {ml} job is assigned to a lazy node it can check (and fail quickly) when
-scaling cannot support the size of the job. When the {operator-feature} is
-enabled, this setting can be updated only by operator users. Defaults to `0b`,
-which means it will be assumed that automatic cluster scaling can add
-arbitrarily large nodes to the cluster.
-
-[[xpack.ml.model_repository]]
-`xpack.ml.model_repository`::
-(<<cluster-update-settings,Dynamic>>)
-The location of the {ml} model repository where the model artifact files are
-available in case of a model installation in a restricted or closed network.
-`xpack.ml.model_repository` can be a string of a file location or an HTTP/HTTPS
-server. Example values are:
-+
---
-```
-xpack.ml.model_repository: file://${path.home}/config/models/
-```
-or
-```
-xpack.ml.model_repository: https://my-custom-backend
-```
-If `xpack.ml.model_repository` is a file location, it must point to a
-subdirectory of the `config` directory of {es}.
---
-
-`xpack.ml.persist_results_max_retries`::
-(<<cluster-update-settings,Dynamic>>) The maximum number of times to retry bulk
-indexing requests that fail while processing {ml} results. If the limit is
-reached, the {ml} job stops processing data and its status is `failed`. When the
-{operator-feature} is enabled, this setting can be updated only by operator
-users. The minimum value is `0`; the maximum value is `50`. Defaults to `20`.
-
-`xpack.ml.process_connect_timeout`::
-(<<cluster-update-settings,Dynamic>>) The connection timeout for {ml} processes
-that run separately from the {es} JVM. When such processes are started they must
-connect to the {es} JVM. If the process does not connect within the time period
-specified by this setting then the process is assumed to have failed. When the
-{operator-feature} is enabled, this setting can be updated only by operator
-users. The minimum value is `5s`. Defaults to `10s`.
-
-`xpack.ml.use_auto_machine_memory_percent`::
-(<<cluster-update-settings,Dynamic>>) If this setting is `true`, the
-`xpack.ml.max_machine_memory_percent` setting is ignored. Instead, the maximum
-percentage of the machine's memory that can be used for running {ml} analytics
-processes is calculated automatically and takes into account the total node size
-and the size of the JVM on the node. When the {operator-feature} is enabled, this
-setting can be updated only by operator users. The default value is `false`.
-+
---
-[IMPORTANT]
-====
-* If you do not have dedicated {ml} nodes (that is to say, the node has
-multiple roles), do not enable this setting. Its calculations assume that {ml}
-analytics are the main purpose of the node.
-* The calculation assumes that dedicated {ml} nodes have at least
-`256MB` memory reserved outside of the JVM. If you have tiny {ml}
-nodes in your cluster, you shouldn't use this setting.
-====
---
-+
-If this setting is `true` it also affects the default value for
-`xpack.ml.max_model_memory_limit`. In this case `xpack.ml.max_model_memory_limit`
-defaults to the largest size that could be assigned in the current cluster.
-
-[discrete]
-[[model-inference-circuit-breaker]]
-==== {ml-cap} circuit breaker settings
+For certain 3rd party service integrations, when the service returns an error indicating that the request
+input was too large, the input will be truncated and the request is retried. These settings govern
+how the truncation is performed.
 
-The relevant circuit breaker settings can be found in the <<circuit-breakers-page-model-inference, Circuit Breakers page>>.
+`xpack.inference.truncator.reduction_percentage`::
+(<<cluster-update-settings,Dynamic>>) Specifies the percentage to reduce the input text by if the 3rd party service
+responds with an error indicating it is too long. Defaults to 50 percent (`0.5`).
+// end::inference-input-text[]