Skip to content

Commit

Permalink
Merge branch 'main' of github.com:elastic/elasticsearch into ml-simil…
Browse files Browse the repository at this point in the history
…arity
  • Loading branch information
jonathan-buttner committed Mar 22, 2024
2 parents 8294ad7 + 35fcc9a commit 6429598
Show file tree
Hide file tree
Showing 157 changed files with 3,116 additions and 1,427 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/gradle-wrapper-validation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: gradle/wrapper-validation-action@v1
- uses: gradle/wrapper-validation-action@699bb18358f12c5b78b37bb0111d3a0e2276e0e2 # Release v2.1.1
5 changes: 5 additions & 0 deletions docs/changelog/105860.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 105860
summary: "ESQL: Re-enable logical dependency check"
area: ES|QL
type: bug
issues: []
6 changes: 6 additions & 0 deletions docs/changelog/106306.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pr: 99961
summary: "added fix for inconsistent text trimming in Unified Highlighter"
area: Highlighting
type: bug
issues:
- 101803
5 changes: 5 additions & 0 deletions docs/changelog/106511.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 106511
summary: Wait indefintely for http connections on shutdown by default
area: Infra/Node Lifecycle
type: bug
issues: []
6 changes: 6 additions & 0 deletions docs/changelog/106654.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pr: 106654
summary: "ES|QL: Fix usage of IN operator with TEXT fields"
area: ES|QL
type: bug
issues:
- 105379
5 changes: 5 additions & 0 deletions docs/changelog/106655.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 106655
summary: Fix Array out of bounds exception in the XLM Roberta tokenizer
area: Machine Learning
type: bug
issues: []
72 changes: 61 additions & 11 deletions docs/internal/DistributedArchitectureGuide.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,70 @@

### ActionListener

`ActionListener`s are a means off injecting logic into lower layers of the code. They encapsulate a block of code that takes a response
value -- the `onResponse()` method --, and then that block of code (the `ActionListener`) is passed into a function that will eventually
execute the code (call `onResponse()`) when a response value is available. `ActionListener`s are used to pass code down to act on a result,
rather than lower layers returning a result back up to be acted upon by the caller. One of three things can happen to a listener: it can be
executed in the same thread — e.g. `ActionListener.run()` --; it can be passed off to another thread to be executed; or it can be added to
a list someplace, to eventually be executed by some service. `ActionListener`s also define `onFailure()` logic, in case an error is
encountered before a result can be formed.
Callbacks are used extensively throughout Elasticsearch because they enable us to write asynchronous and nonblocking code, i.e. code which
doesn't necessarily compute a result straight away but also doesn't block the calling thread waiting for the result to become available.
They support several useful control flows:

- They can be completed immediately on the calling thread.
- They can be completed concurrently on a different thread.
- They can be stored in a data structure and completed later on when the system reaches a particular state.
- Most commonly, they can be passed on to other methods that themselves require a callback.
- They can be wrapped in another callback which modifies the behaviour of the original callback, perhaps adding some extra code to run
before or after completion, before passing them on.

`ActionListener` is a general-purpose callback interface that is used extensively across the Elasticsearch codebase. `ActionListener` is
used pretty much everywhere that needs to perform some asynchronous and nonblocking computation. The uniformity makes it easier to compose
parts of the system together without needing to build adapters to convert back and forth between different kinds of callback. It also makes
it easier to develop the skills needed to read and understand all the asynchronous code, although this definitely takes practice and is
certainly not easy in an absolute sense. Finally, it has allowed us to build a rich library for working with `ActionListener` instances
themselves, creating new instances out of existing ones and completing them in interesting ways. See for instance:

- all the static methods on [ActionListener](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/ActionListener.java) itself
- [`ThreadedActionListener`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/ThreadedActionListener.java) for forking work elsewhere
- [`RefCountingListener`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/RefCountingListener.java) for running work in parallel
- [`SubscribableListener`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java) for constructing flexible workflows

Callback-based asynchronous code can easily call regular synchronous code, but synchronous code cannot run callback-based asynchronous code
without blocking the calling thread until the callback is called back. This blocking is at best undesirable (threads are too expensive to
waste with unnecessary blocking) and at worst outright broken (the blocking can lead to deadlock). Unfortunately this means that most of our
code ends up having to be written with callbacks, simply because it's ultimately calling into some other code that takes a callback. The
entry points for all Elasticsearch APIs are callback-based (e.g. REST APIs all start at
[`org.elasticsearch.rest.BaseRestHandler#prepareRequest`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/rest/BaseRestHandler.java#L158-L171),
and transport APIs all start at
[`org.elasticsearch.action.support.TransportAction#doExecute`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/TransportAction.java#L65))
and the whole system fundamentally works in terms of an event loop (a `io.netty.channel.EventLoop`) which processes network events via
callbacks.

`ActionListener` is not an _ad-hoc_ invention. Formally speaking, it is our implementation of the general concept of a continuation in the
sense of [_continuation-passing style_](https://en.wikipedia.org/wiki/Continuation-passing_style) (CPS): an extra argument to a function
which defines how to continue the computation when the result is available. This is in contrast to _direct style_ which is the more usual
style of calling methods that return values directly back to the caller so they can continue executing as normal. There's essentially two
ways that computation can continue in Java (it can return a value or it can throw an exception) which is why `ActionListener` has both an
`onResponse()` and an `onFailure()` method.

CPS is strictly more expressive than direct style: direct code can be mechanically translated into continuation-passing style, but CPS also
enables all sorts of other useful control structures such as forking work onto separate threads, possibly to be executed in parallel,
perhaps even across multiple nodes, or possibly collecting a list of continuations all waiting for the same condition to be satisfied before
proceeding (e.g.
[`SubscribableListener`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java)
amongst many others). Some languages have first-class support for continuations (e.g. the `async` and `await` primitives in C#) allowing the
programmer to write code in direct style away from those exotic control structures, but Java does not. That's why we have to manipulate all
the callbacks ourselves.

Strictly speaking, CPS requires that a computation _only_ continues by calling the continuation. In Elasticsearch, this means that
asynchronous methods must have `void` return type and may not throw any exceptions. This is mostly the case in our code as written today,
and is a good guiding principle, but we don't enforce void exceptionless methods and there are some deviations from this rule. In
particular, it's not uncommon to permit some methods to throw an exception, using things like
[`ActionListener#run`](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/ActionListener.java#L381-L390)
(or an equivalent `try ... catch ...` block) further up the stack to handle it. Some methods also take (and may complete) an
`ActionListener` parameter, but still return a value separately for other local synchronous work.

This pattern is often used in the transport action layer with the use of the
[ChannelActionListener]([url](https://github.com/elastic/elasticsearch/blob/8.12/server/src/main/java/org/elasticsearch/action/support/ChannelActionListener.java))
[ChannelActionListener](https://github.com/elastic/elasticsearch/blob/v8.12.2/server/src/main/java/org/elasticsearch/action/support/ChannelActionListener.java)
class, which wraps a `TransportChannel` produced by the transport layer. `TransportChannel` implementations can hold a reference to a Netty
channel with which to pass the response back to the network caller. Netty has a many-to-one association of network callers to channels, so
a call taking a long time generally won't hog resources: it's cheap. A transport action can take hours to respond and that's alright,
barring caller timeouts.
channel with which to pass the response back to the network caller. Netty has a many-to-one association of network callers to channels, so a
call taking a long time generally won't hog resources: it's cheap. A transport action can take hours to respond and that's alright, barring
caller timeouts.

(TODO: add useful starter references and explanations for a range of Listener classes. Reference the Netty section.)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,32 +55,32 @@ Contains the set of rules that are actively used for sync jobs. The `active` obj
The value to be used in conjunction with the rule for matching the contents of the document's field.
** `order` (Required, number) +
The order in which the rules are applied. The first rule to match has its policy applied.
** `created_at` (Optional, datetime) +
** `created_at` (Required, datetime) +
The timestamp when the rule was added.
** `updated_at` (Optional, datetime) +
** `updated_at` (Required, datetime) +
The timestamp when the rule was last edited.

* `advanced_snippet` (Optional, object) +
* `advanced_snippet` (Required, object) +
Used for {enterprise-search-ref}/sync-rules.html#sync-rules-advanced[advanced filtering] at query time, with the following sub-attributes:
** `value` (Required, object) +
A JSON object passed directly to the connector for advanced filtering.
** `created_at` (Optional, datetime) +
** `created_at` (Required, datetime) +
The timestamp when this JSON object was created.
** `updated_at` (Optional, datetime) +
** `updated_at` (Required, datetime) +
The timestamp when this JSON object was last edited.

* `validation` (Optional, object) +
* `validation` (Required, object) +
Provides validation status for the rules, including:
** `state` (Required, string) +
Indicates the validation state: "edited", "valid", or "invalid".
** `errors` (Optional, object) +
** `errors` (Required, object) +
Contains details about any validation errors, with sub-attributes:
*** `ids` (Required, string) +
The ID(s) of any rules deemed invalid.
*** `messages` (Required, string) +
Messages explaining what is invalid about the rules.

- `draft` (Optional, object) +
- `draft` (Required, object) +
An object identical in structure to the `active` object, but used for drafting and editing filtering rules before they become active.


Expand Down
21 changes: 21 additions & 0 deletions docs/reference/esql/functions/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
The files in these subdirectories and generated by ESQL's test suite:
* `description` - description of each function scraped from `@FunctionInfo#description`
* `examples` - examples of each function scraped from `@FunctionInfo#examples`
* `parameters` - description of each function's parameters scraped from `@Param`
* `signature` - railroad diagram of the syntax to invoke each function
* `types` - a table of each combination of support type for each parameter. These are generated from tests.
* `layout` - a fully generated description for each function

Most functions can use the generated docs generated in the `layout` directory.
If we need something more custom for the function we can make a file in this
directory that can `include::` any parts of the files above.

To regenerate the files for a function run its tests using gradle:
```
./gradlew :x-pack:plugin:esql:tests -Dtests.class='*SinTests'
```

To regenerate the files for all functions run all of ESQL's tests using gradle:
```
./gradlew :x-pack:plugin:esql:tests
```
2 changes: 1 addition & 1 deletion docs/reference/modules/discovery/discovery.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ supplied in `unicast_hosts.txt`.

The `unicast_hosts.txt` file contains one node entry per line. Each node entry
consists of the host (host name or IP address) and an optional transport port
number. If the port number is specified, is must come immediately after the
number. If the port number is specified, it must come immediately after the
host (on the same line) separated by a `:`. If the port number is not
specified, {es} will implicitly use the first port in the port range given by
`transport.profiles.default.port`, or by `transport.port` if
Expand Down
41 changes: 22 additions & 19 deletions docs/reference/modules/node.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -68,8 +68,8 @@ A node that has the `master` role, which makes it eligible to be

<<data-node,Data node>>::

A node that has the `data` role. Data nodes hold data and perform data
related operations such as CRUD, search, and aggregations. A node with the `data` role can fill any of the specialised data node roles.
A node that has one of several data roles. Data nodes hold data and perform data
related operations such as CRUD, search, and aggregations. A node with a generic `data` role can fill any of the specialized data node roles.

<<node-ingest-node,Ingest node>>::

Expand Down Expand Up @@ -220,7 +220,7 @@ therefore ensure that the storage and networking available to the nodes in your
cluster are good enough to meet your performance goals.

[[data-node]]
==== Data node
==== Data nodes

Data nodes hold the shards that contain the documents you have indexed. Data
nodes handle data related operations like CRUD, search, and aggregations.
Expand All @@ -230,20 +230,27 @@ monitor these resources and to add more data nodes if they are overloaded.
The main benefit of having dedicated data nodes is the separation of the master
and data roles.

To create a dedicated data node, set:
In a multi-tier deployment architecture, you use specialized data roles to
assign data nodes to specific tiers: `data_content`,`data_hot`, `data_warm`,
`data_cold`, or `data_frozen`. A node can belong to multiple tiers.

If you want to include a node in all tiers, or if your cluster does not use multiple tiers, then you can use the generic `data` role.

WARNING: If you assign a node to a specific tier using a specialized data role, then you shouldn't also assign it the generic `data` role. The generic `data` role takes precedence over specialized data roles.

[[generic-data-node]]
===== Generic data node

Generic data nodes are included in all content tiers.

To create a dedicated generic data node, set:
[source,yaml]
----
node.roles: [ data ]
----

In a multi-tier deployment architecture, you use specialized data roles to
assign data nodes to specific tiers: `data_content`,`data_hot`, `data_warm`,
`data_cold`, or `data_frozen`. A node can belong to multiple tiers, but a node
that has one of the specialized data roles cannot have the generic `data` role.

[role="xpack"]
[[data-content-node]]
==== Content data node
===== Content data node

Content data nodes are part of the content tier.
include::{es-repo-dir}/datatiers.asciidoc[tag=content-tier]
Expand All @@ -254,9 +261,8 @@ To create a dedicated content node, set:
node.roles: [ data_content ]
----

[role="xpack"]
[[data-hot-node]]
==== Hot data node
===== Hot data node

Hot data nodes are part of the hot tier.
include::{es-repo-dir}/datatiers.asciidoc[tag=hot-tier]
Expand All @@ -267,9 +273,8 @@ To create a dedicated hot node, set:
node.roles: [ data_hot ]
----

[role="xpack"]
[[data-warm-node]]
==== Warm data node
===== Warm data node

Warm data nodes are part of the warm tier.
include::{es-repo-dir}/datatiers.asciidoc[tag=warm-tier]
Expand All @@ -280,9 +285,8 @@ To create a dedicated warm node, set:
node.roles: [ data_warm ]
----

[role="xpack"]
[[data-cold-node]]
==== Cold data node
===== Cold data node

Cold data nodes are part of the cold tier.
include::{es-repo-dir}/datatiers.asciidoc[tag=cold-tier]
Expand All @@ -293,9 +297,8 @@ To create a dedicated cold node, set:
node.roles: [ data_cold ]
----

[role="xpack"]
[[data-frozen-node]]
==== Frozen data node
===== Frozen data node

Frozen data nodes are part of the frozen tier.
include::{es-repo-dir}/datatiers.asciidoc[tag=frozen-tier]
Expand Down
Loading

0 comments on commit 6429598

Please sign in to comment.