Merge branch 'main' of github.com:elastic/elasticsearch into ml-simil…

…arity
jonathan-buttner · Mar 22, 2024 · 6429598 · 6429598
2 parents 8294ad7 + 35fcc9a
commit 6429598
Show file tree

Hide file tree

Showing 157 changed files with 3,116 additions and 1,427 deletions.
diff --git a/.github/workflows/gradle-wrapper-validation.yml b/.github/workflows/gradle-wrapper-validation.yml
@@ -10,4 +10,4 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v2
-      - uses: gradle/wrapper-validation-action@v1
+      - uses: gradle/wrapper-validation-action@699bb18358f12c5b78b37bb0111d3a0e2276e0e2 # Release v2.1.1
diff --git a/docs/changelog/105860.yaml b/docs/changelog/105860.yaml
@@ -0,0 +1,5 @@
+pr: 105860
+summary: "ESQL: Re-enable logical dependency check"
+area: ES|QL
+type: bug
+issues: []
diff --git a/docs/changelog/106306.yaml b/docs/changelog/106306.yaml
@@ -0,0 +1,6 @@
+pr: 99961
+summary: "added fix for inconsistent text trimming in Unified Highlighter"
+area: Highlighting
+type: bug
+issues:
+ - 101803
diff --git a/docs/changelog/106511.yaml b/docs/changelog/106511.yaml
@@ -0,0 +1,5 @@
+pr: 106511
+summary: Wait indefintely for http connections on shutdown by default
+area: Infra/Node Lifecycle
+type: bug
+issues: []
diff --git a/docs/changelog/106654.yaml b/docs/changelog/106654.yaml
@@ -0,0 +1,6 @@
+pr: 106654
+summary: "ES|QL: Fix usage of IN operator with TEXT fields"
+area: ES|QL
+type: bug
+issues:
+ - 105379
diff --git a/docs/changelog/106655.yaml b/docs/changelog/106655.yaml
@@ -0,0 +1,5 @@
+pr: 106655
+summary: Fix Array out of bounds exception in the XLM Roberta tokenizer
+area: Machine Learning
+type: bug
+issues: []
diff --git a/docs/internal/DistributedArchitectureGuide.md b/docs/internal/DistributedArchitectureGuide.md
@@ -10,20 +10,70 @@
 
 ### ActionListener
 
-`ActionListener`s are a means off injecting logic into lower layers of the code. They encapsulate a block of code that takes a response
-value -- the `onResponse()` method --, and then that block of code (the `ActionListener`) is passed into a function that will eventually
-execute the code (call `onResponse()`) when a response value is available. `ActionListener`s are used to pass code down to act on a result,
-rather than lower layers returning a result back up to be acted upon by the caller. One of three things can happen to a listener: it can be
-executed in the same thread — e.g. `ActionListener.run()` --; it can be passed off to another thread to be executed; or it can be added to
-a list someplace, to eventually be executed by some service. `ActionListener`s also define `onFailure()` logic, in case an error is
-encountered before a result can be formed.
+Callbacks are used extensively throughout Elasticsearch because they enable us to write asynchronous and nonblocking code, i.e. code which
+doesn't necessarily compute a result straight away but also doesn't block the calling thread waiting for the result to become available.
+They support several useful control flows:
+
+- They can be completed immediately on the calling thread.
+- They can be completed concurrently on a different thread.
+- They can be stored in a data structure and completed later on when the system reaches a particular state.
+- Most commonly, they can be passed on to other methods that themselves require a callback.
+- They can be wrapped in another callback which modifies the behaviour of the original callback, perhaps adding some extra code to run
+  before or after completion, before passing them on.
+
+`ActionListener` is a general-purpose callback interface that is used extensively across the Elasticsearch codebase. `ActionListener` is
+used pretty much everywhere that needs to perform some asynchronous and nonblocking computation. The uniformity makes it easier to compose
+parts of the system together without needing to build adapters to convert back and forth between different kinds of callback. It also makes
+it easier to develop the skills needed to read and understand all the asynchronous code, although this definitely takes practice and is
+certainly not easy in an absolute sense. Finally, it has allowed us to build a rich library for working with `ActionListener` instances
+themselves, creating new instances out of existing ones and completing them in interesting ways. See for instance:
+
+- all the static methods on [ActionListener](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/ActionListener.java) itself
+- [`ThreadedActionListener`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/ThreadedActionListener.java) for forking work elsewhere
+- [`RefCountingListener`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/RefCountingListener.java) for running work in parallel
+- [`SubscribableListener`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java) for constructing flexible workflows
+
+Callback-based asynchronous code can easily call regular synchronous code, but synchronous code cannot run callback-based asynchronous code
+without blocking the calling thread until the callback is called back. This blocking is at best undesirable (threads are too expensive to
+waste with unnecessary blocking) and at worst outright broken (the blocking can lead to deadlock). Unfortunately this means that most of our
+code ends up having to be written with callbacks, simply because it's ultimately calling into some other code that takes a callback. The
+entry points for all Elasticsearch APIs are callback-based (e.g. REST APIs all start at
+[`org.elasticsearch.rest.BaseRestHandler#prepareRequest`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/rest/BaseRestHandler.java#L158-L171),
+and transport APIs all start at
+[`org.elasticsearch.action.support.TransportAction#doExecute`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/TransportAction.java#L65))
+and the whole system fundamentally works in terms of an event loop (a `io.netty.channel.EventLoop`) which processes network events via
+callbacks.
+
+`ActionListener` is not an _ad-hoc_ invention. Formally speaking, it is our implementation of the general concept of a continuation in the
+sense of [_continuation-passing style_](https://en.wikipedia.org/wiki/Continuation-passing_style) (CPS): an extra argument to a function
+which defines how to continue the computation when the result is available. This is in contrast to _direct style_ which is the more usual
+style of calling methods that return values directly back to the caller so they can continue executing as normal. There's essentially two
+ways that computation can continue in Java (it can return a value or it can throw an exception) which is why `ActionListener` has both an
+`onResponse()` and an `onFailure()` method.
+
+CPS is strictly more expressive than direct style: direct code can be mechanically translated into continuation-passing style, but CPS also
+enables all sorts of other useful control structures such as forking work onto separate threads, possibly to be executed in parallel,
+perhaps even across multiple nodes, or possibly collecting a list of continuations all waiting for the same condition to be satisfied before
+proceeding (e.g.
+[`SubscribableListener`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java)
+amongst many others). Some languages have first-class support for continuations (e.g. the `async` and `await` primitives in C#) allowing the
+programmer to write code in direct style away from those exotic control structures, but Java does not. That's why we have to manipulate all
+the callbacks ourselves.
+
+Strictly speaking, CPS requires that a computation _only_ continues by calling the continuation. In Elasticsearch, this means that
+asynchronous methods must have `void` return type and may not throw any exceptions. This is mostly the case in our code as written today,
+and is a good guiding principle, but we don't enforce void exceptionless methods and there are some deviations from this rule. In
+particular, it's not uncommon to permit some methods to throw an exception, using things like
+[`ActionListener#run`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/ActionListener.java#L381-L390)
+(or an equivalent `try ... catch ...` block) further up the stack to handle it. Some methods also take (and may complete) an
+`ActionListener` parameter, but still return a value separately for other local synchronous work.
 
 This pattern is often used in the transport action layer with the use of the
-[ChannelActionListener]([url](https://github.com/elastic/elasticsearch/blob/8.12/server/src/main/java/org/elasticsearch/action/support/ChannelActionListener.java))
+[ChannelActionListener](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/ChannelActionListener.java)
 class, which wraps a `TransportChannel` produced by the transport layer. `TransportChannel` implementations can hold a reference to a Netty
-channel with which to pass the response back to the network caller. Netty has a many-to-one association of network callers to channels, so
-a call taking a long time generally won't hog resources: it's cheap. A transport action can take hours to respond and that's alright,
-barring caller timeouts.
+channel with which to pass the response back to the network caller. Netty has a many-to-one association of network callers to channels, so a
+call taking a long time generally won't hog resources: it's cheap. A transport action can take hours to respond and that's alright, barring
+caller timeouts.
 
 (TODO: add useful starter references and explanations for a range of Listener classes. Reference the Netty section.)
 

diff --git a/docs/reference/connector/apis/update-connector-filtering-api.asciidoc b/docs/reference/connector/apis/update-connector-filtering-api.asciidoc
@@ -55,32 +55,32 @@ Contains the set of rules that are actively used for sync jobs. The `active` obj
     The value to be used in conjunction with the rule for matching the contents of the document's field.
     ** `order` (Required, number) +
     The order in which the rules are applied. The first rule to match has its policy applied.
-    ** `created_at` (Optional, datetime) +
+    ** `created_at` (Required, datetime) +
     The timestamp when the rule was added.
-    ** `updated_at` (Optional, datetime) +
+    ** `updated_at` (Required, datetime) +
     The timestamp when the rule was last edited.
 
-  * `advanced_snippet` (Optional, object) +
+  * `advanced_snippet` (Required, object) +
   Used for {enterprise-search-ref}/sync-rules.html#sync-rules-advanced[advanced filtering] at query time, with the following sub-attributes:
     ** `value` (Required, object) +
     A JSON object passed directly to the connector for advanced filtering.
-    ** `created_at` (Optional, datetime) +
+    ** `created_at` (Required, datetime) +
     The timestamp when this JSON object was created.
-    ** `updated_at` (Optional, datetime) +
+    ** `updated_at` (Required, datetime) +
     The timestamp when this JSON object was last edited.
 
-  * `validation` (Optional, object) +
+  * `validation` (Required, object) +
   Provides validation status for the rules, including:
     ** `state` (Required, string) +
     Indicates the validation state: "edited", "valid", or "invalid".
-    ** `errors` (Optional, object) +
+    ** `errors` (Required, object) +
     Contains details about any validation errors, with sub-attributes:
       *** `ids` (Required, string) +
       The ID(s) of any rules deemed invalid.
       *** `messages` (Required, string) +
       Messages explaining what is invalid about the rules.
 
-- `draft` (Optional, object) +
+- `draft` (Required, object) +
 An object identical in structure to the `active` object, but used for drafting and editing filtering rules before they become active.
 
 

diff --git a/docs/reference/esql/functions/README.md b/docs/reference/esql/functions/README.md
@@ -0,0 +1,21 @@
+The files in these subdirectories and generated by ESQL's test suite:
+* `description` - description of each function scraped from `@FunctionInfo#description`
+* `examples` - examples of each function scraped from `@FunctionInfo#examples`
+* `parameters` - description of each function's parameters scraped from `@Param`
+* `signature` - railroad diagram of the syntax to invoke each function
+* `types` - a table of each combination of support type for each parameter. These are generated from tests.
+* `layout` - a fully generated description for each function
+
+Most functions can use the generated docs generated in the `layout` directory.
+If we need something more custom for the function we can make a file in this
+directory that can `include::` any parts of the files above.
+
+To regenerate the files for a function run its tests using gradle:
+```
+./gradlew :x-pack:plugin:esql:tests -Dtests.class='*SinTests'
+```
+
+To regenerate the files for all functions run all of ESQL's tests using gradle:
+```
+./gradlew :x-pack:plugin:esql:tests
+```
diff --git a/docs/reference/modules/discovery/discovery.asciidoc b/docs/reference/modules/discovery/discovery.asciidoc
@@ -115,7 +115,7 @@ supplied in `unicast_hosts.txt`.
 
 The `unicast_hosts.txt` file contains one node entry per line. Each node entry
 consists of the host (host name or IP address) and an optional transport port
-number. If the port number is specified, is must come immediately after the
+number. If the port number is specified, it must come immediately after the
 host (on the same line) separated by a `:`. If the port number is not
 specified, {es} will implicitly use the first port in the port range given by
 `transport.profiles.default.port`, or by `transport.port` if

diff --git a/docs/reference/modules/node.asciidoc b/docs/reference/modules/node.asciidoc
@@ -68,8 +68,8 @@ A node that has the `master` role, which makes it eligible to be
 
 <<data-node,Data node>>::
 
-A node that has the `data` role. Data nodes hold data and perform data
-related operations such as CRUD, search, and aggregations. A node with the `data` role can fill any of the specialised data node roles.
+A node that has one of several data roles. Data nodes hold data and perform data
+related operations such as CRUD, search, and aggregations. A node with a generic `data` role can fill any of the specialized data node roles.
 
 <<node-ingest-node,Ingest node>>::
 
@@ -220,7 +220,7 @@ therefore ensure that the storage and networking available to the nodes in your
 cluster are good enough to meet your performance goals.
 
 [[data-node]]
-==== Data node
+==== Data nodes
 
 Data nodes hold the shards that contain the documents you have indexed. Data
 nodes handle data related operations like CRUD, search, and aggregations.
@@ -230,20 +230,27 @@ monitor these resources and to add more data nodes if they are overloaded.
 The main benefit of having dedicated data nodes is the separation of the master
 and data roles.
 
-To create a dedicated data node, set:
+In a multi-tier deployment architecture, you use specialized data roles to
+assign data nodes to specific tiers: `data_content`,`data_hot`, `data_warm`,
+`data_cold`, or `data_frozen`. A node can belong to multiple tiers. 
+
+If you want to include a node in all tiers, or if your cluster does not use multiple tiers, then you can use the generic `data` role.
+
+WARNING: If you assign a node to a specific tier using a specialized data role, then you shouldn't also assign it the generic `data` role. The generic `data` role takes precedence over specialized data roles.
+
+[[generic-data-node]]
+===== Generic data node
+
+Generic data nodes are included in all content tiers. 
+
+To create a dedicated generic data node, set:
 [source,yaml]
 ----
 node.roles: [ data ]
 ----
 
-In a multi-tier deployment architecture, you use specialized data roles to
-assign data nodes to specific tiers: `data_content`,`data_hot`, `data_warm`,
-`data_cold`, or `data_frozen`. A node can belong to multiple tiers, but a node
-that has one of the specialized data roles cannot have the generic `data` role.
-
-[role="xpack"]
 [[data-content-node]]
-==== Content data node
+===== Content data node
 
 Content data nodes are part of the content tier.
 include::{es-repo-dir}/datatiers.asciidoc[tag=content-tier]
@@ -254,9 +261,8 @@ To create a dedicated content node, set:
 node.roles: [ data_content ]
 ----
 
-[role="xpack"]
 [[data-hot-node]]
-==== Hot data node
+===== Hot data node
 
 Hot data nodes are part of the hot tier.
 include::{es-repo-dir}/datatiers.asciidoc[tag=hot-tier]
@@ -267,9 +273,8 @@ To create a dedicated hot node, set:
 node.roles: [ data_hot ]
 ----
 
-[role="xpack"]
 [[data-warm-node]]
-==== Warm data node
+===== Warm data node
 
 Warm data nodes are part of the warm tier.
 include::{es-repo-dir}/datatiers.asciidoc[tag=warm-tier]
@@ -280,9 +285,8 @@ To create a dedicated warm node, set:
 node.roles: [ data_warm ]
 ----
 
-[role="xpack"]
 [[data-cold-node]]
-==== Cold data node
+===== Cold data node
 
 Cold data nodes are part of the cold tier.
 include::{es-repo-dir}/datatiers.asciidoc[tag=cold-tier]
@@ -293,9 +297,8 @@ To create a dedicated cold node, set:
 node.roles: [ data_cold ]
 ----
 
-[role="xpack"]
 [[data-frozen-node]]
-==== Frozen data node
+===== Frozen data node
 
 Frozen data nodes are part of the frozen tier.
 include::{es-repo-dir}/datatiers.asciidoc[tag=frozen-tier]