Add reroute processor #76511

felixbarny · 2021-08-13T17:11:51Z

The reroute processor allows to route a document to another target index or data stream. It has two main modes:

When setting the destination option, the target is explicitly specified and the dataset and namespace options can’t be set.

When the destination option is not set, this processor is in a data stream mode. Note that in this mode, the reroute processor can only be used on data streams that follow the data stream naming scheme. Trying to use this processor on a data stream with a non-compliant name will raise an exception.

The name of a data stream consists of three parts: --. See the data stream naming scheme documentation for more details.

This processor can use both static values or reference fields from the document to determine the dataset and namespace components of the new target. See Table 38, “Reroute options” for more details.

It’s not possible to change the type of the data stream with the reroute processor.
After a reroute processor has been executed, all the other processors of the current pipeline are skipped, including the final pipeline. If the current pipeline is executed in the context of a Pipeline, the calling pipeline will be skipped, too. This means that at most one reroute processor is ever executed within a pipeline, allowing to define mutually exclusive routing conditions, similar to a if, else-if, else-if, … condition.

The reroute processor ensures that the data_stream.<type|dataset|namespace> fields are set according to the new target. If the document contains a event.dataset value, it will be updated to reflect the same value as data_stream.dataset.

Note that the client needs to have permissions to the final target. Otherwise, the document will be rejected with a security exception which looks like this:

{"type":"security_exception","reason":"action [indices:admin/auto_create] is unauthorized for API key id [8-dt9H8BqGblnY2uSI--] of user [elastic/fleet-server] on indices [logs-foo-default], this action is granted by the index privileges [auto_configure,create_index,manage,all]"}

Reroute options

Name	Required	Default	Description
`destination`	no	-	A static value for the target. Can't be set when the `dataset` or `namespace` option is set.
`dataset`	no	`{{data_stream.dataset}}`	Field references or a static value for the dataset part of the data stream name. In addition to the criteria for <<indices-create-api-path-params, index names>>, cannot contain `-` and must be no longer than 100 characters. Example values are `nginx.access` and `nginx.error`. Supports field references with a mustache-like syntax (denoted as `{{double}}` or `{{{triple}}}` curly braces). When resolving field references, the processor replaces invalid characters with `_`. Uses the `<dataset>` part of the index name as a fallback if all field references resolve to a `null`, missing, or non-string value.
`namespace`	no	`{{data_stream.namespace}}`	Field references or a static value for the namespace part of the data stream name. See the criteria for <<indices-create-api-path-params, index names>> for allowed characters. Must be no longer than 100 characters. Supports field references with a mustache-like syntax (denoted as `{{double}}` or `{{{triple}}}` curly braces). When resolving field references, the processor replaces invalid characters with `_`. Uses the `<namespace>` part of the index name as a fallback if all field references resolve to a `null`, missing, or non-string value.

The if option can be used to define the condition in which the document should be rerouted to a new target.

{
  "data_stream_router": {
    "tag": "nginx",
    "if" : "ctx?.log?.file?.path?.contains('nginx')",
    "type": "logs",
    "dataset": "nginx"
  }
}

The dataset and namespace options can contain either a single value or a list of values that are used as a fallback.
If a field reference evaluates to null, is not present in the document, the next value or field reference is used.
If a field reference evaluates to a non-String value, the processor fails.

In the following example, the processor would first try to resolve the value for the service.name field to determine the value for dataset.
If that field resolves to null, is missing, or is a non-string value, it would try the next element in the list.
In this case, this is the static value "generic".
The namespace option is configured with just a single static value.

{
  "reroute": {
    "dataset": [
        "{{service.name}}",
        "generic"
    ],
    "namespace": "default"
  }
}

Depends on

Introduce reroute method on IngestDocument #94000

ruflin · 2022-04-11T09:03:38Z

...s/ingest-common/src/main/java/org/elasticsearch/ingest/common/DataStreamRouterProcessor.java

+    private static final String DATA_STREAM_DEFAULT_NAMESPACE = "default";
+    private static final String EVENT_DATASET = "event.dataset";
+
+    private static final char[] DISALLOWED_IN_DATASET = new char[] { '\\', '/', '*', '?', '\"', '<', '>', '|', ' ', ',', '#', ':', '-' };


Can you make sure this is aligned with what we have in the https://github.com/elastic/package-spec for validation? @mtojek Will likely know where to point you to. Same for namespace.

I don't think the package spec currently defines the allowed characters for data_stream.* fields. See also the spec for dataset. I've just tried using invalid characters in the manifest of an integration, such as upper-case chars and -. Both elastic-package lint and elastic-package build did not yield an error.

I took the validation rules from https://github.com/elastic/ecs/blob/main/rfcs/text/0009-data_stream-fields.md#restrictions-on-values and https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html#indices-create-api-path-params.

@mtojek @jsoriano Can you chime in here? I would have expected that we do some validation.

We have this issue open since long ago elastic/package-spec#57, I will ensure this ends up in our backlog.

...s/ingest-common/src/main/java/org/elasticsearch/ingest/common/DataStreamRouterProcessor.java

docs/reference/ingest/processors/data-stream-router.asciidoc

felixbarny · 2022-04-14T11:40:25Z

Discussing some option questions:

Only set `event.dataset` if it is already set on the document to ensure that it's the exact same value as `data_stream.dataset`?

This has the implication that ECS loggers need to set event.dataset in addition or instead of data_stream.dataset so that the logs rate component in the logs ui works properly. But to be compatible with the widest range of stack versions and UI features that rely on event.dataset, ECS loggers will need to set both anyway. See also elastic/ecs-logging#38

Overall a net-positive change in my books to not overly promote event.dataset. Implemented in c30141d.

Don't treat `event.dataset` as a fallback for routing if `data_stream.dataset` is empty?

This doesn't need to be part of this processor. If this makes sense for specific use cases (I'm thinking about ECS logs), we can just add a processor like this:

- set:
    field: data_stream.dataset
    copy_from: event.dataset
    if: ctx.event?.dataset instanceof String && ctx.event.dataset.length() > 1
    override: false

If we can rely on the fact that shippers always populate data_stream.* fields when sending events to a routing pipeline, we could just ignore event.dataset and simplify the logic.

There's a special case for ECS loggers where the shipper would send the data to a generic routing pipeline but

Other than having the processor natively falling back to event.dataset, users could just use a set processor if they want the fallback.

However, this has the implication that if event.dataset: foo is set on the document, and data_stream.dataset is unset, the doc will be routed to the generic dataset and event.dataset will be overridden to generic. That's to comply with the spec that event.dataset and data_stream.dataset should be the same.

This is quite a constructed case, however, as I can't think of a valid case where event.dataset is set and data_stream.dataset is missing. That's even true for ECS loggers. While the might set only the former, the log message will end up as a JSON-encoded string in the message field. We can make a fallback in the specific ECS JSON log pipeline.

Should we sanitize `data_stream.*` fields that contain invalid characters?

I think we should. This makes it much easier to route events to a data stream with dynamically computed values. For example using azure.springcloudlogs.app_name as the dataset. There's no restrictions on the allowed characters for this field. Still, it makes for a good dataset. Rather than each processor having to sanitize dynamic values, it seems better to rely on the data_stream_router processor for sanitization.

Should we parse `_index` if `data_stream.*` fields are missing?

While our integrations set data_stream.* fields, there's no general guarantee that these are available. Especially when events are coming from shippers we don't control, these might be missing.

It seems sensible for this processor to guarantee that the index name is always consistent with the data_stream.* fields. It could do so by falling back to parsing _index rather than always using dataset: generic namespace: default as a fallback.

But I'd say this is an optional enhancement of the processor that's not necessarily in scope of the first iteration.

felixbarny · 2022-04-14T15:02:37Z

Should we parse _index if data_stream.* fields are missing?

Implemented in 41c23cc.

Implications:

Processor can only be used on data streams matching <type>-<dataset>-<namespace>
Ensures data_stream.* fields are in sync with _index, even if the source doc didn't contain data_stream.* fields
Disallow changing type, only inferred from _index
The type is not configurable on the processor

elasticmachine · 2022-04-15T15:59:27Z

Pinging @elastic/es-data-management (Team:Data Management)

ruflin · 2022-04-25T07:10:09Z

@felixbarny My general comment around #76511 (comment) is, lets keep the first iteration as simple as possible. It is always possible to add more features.

Lets ignore event.dataset, user can use set processor
Sanitise: Can we skip it at first or make it a config option that is off by default? I agree that we will need it eventually.
Parse _index: Would be great, but could be done as an enhancement

felixbarny · 2022-04-25T10:09:55Z

Lets ignore event.dataset, user can use set processor

Currently, the processor only makes sure that if the source event contains event.dataset it doesn't conflict with data_stream.dataset. But I agree that this could also be done with a set processor. However, the risk of that is that event.dataset and data_stream.dataset may be inconsistent.

Sanitise: Can we skip it at first or make it a config option that is off by default? I agree that we will need it eventually.

What are the cases where you'd want to have sanitization off? Could we make sanitization on by default (which is what the PR currently does) and later add a switch to disable it?

Parse _index: Would be great, but could be done as an enhancement

I've added that already. At the same time, I've removed the ability to set data_stream.type in the processor as it will always use the type derived from the index name. Hence, I think this should go in with the first version as it wouldn't be possible to add this later without breaking changes.

ruflin · 2022-04-25T12:23:19Z

What are the cases where you'd want to have sanitization off? Could we make sanitization on by default (which is what the PR currently does) and later add a switch to disable it?

If it does the sanitization, will it also cleanup the document itself meaning modifying data_stream.dataset value? My suggestion having it off by default is mostly my instinct to disable magic by default which requires users to think about why they do not match the default. But I see your scenario that users just want to ingest logs and should not be exposed to this detail. I'm on board with turning it on by default which could also mean not having the config at first ;-)

ruflin

I left some comments for the docs. But this should not hold this PR back. We can always follow up with more detailed docs.

What if a users users type: foo and then the data stream values? It seems we don't validate the type anywhere? It should only be los, metrics, traces or synthetics.

docs/reference/ingest/processors/reroute.asciidoc

ruflin · 2023-04-11T06:51:35Z

docs/reference/ingest/processors/reroute.asciidoc

+Supports field references with a mustache-like syntax (denoted as `{{double}}` or `{{{triple}}}` curly braces). When resolving field references, the processor replaces invalid characters with `_`. Uses the `<dataset>` part of the index name as a fallback if all field references resolve to a `null`, missing, or non-string value.
+| `namespace`   | no        | `{{data_stream.namespace}}` a| Field references or a static value for the namespace part of the data stream name. See the criteria for <<indices-create-api-path-params, index names>> for allowed characters. Must be no longer than 100 characters.
+
+Supports field references with a mustache-like syntax (denoted as `{{double}}` or `{{{triple}}}` curly braces). When resolving field references, the processor replaces invalid characters with `_`. Uses the `<namespace>` part of the index name as a fallback if all field references resolve to a `null`, missing, or non-string value.


Maybe use a * and write it below just once? There you could also add the invalid chars list.

ruflin · 2023-04-11T06:53:38Z

docs/reference/ingest/processors/reroute.asciidoc

+{
+  "reroute": {
+    "tag": "nginx",
+    "if" : "ctx?.log?.file?.path?.contains('nginx')",


We need to describe a little bit more around how this if condition work. Why ctx is there and all the ?. If you are not used to write ingest pipeline (conditions), this might not be obvious. I would do a very quick description here and then link to the ES docs.

The docs contain a description of the if option and a link to the corresponding docs. Isn't that enough?

This is what we do in most places of our docs and I really don't like it. The link is important but for a user to be successful, a user should not have to jump through 3 doc pages to get a single task completed. Instead I would rather repeat some of the docs that the basic use case is all covered in on flow.

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RerouteProcessor.java

Co-authored-by: Nicolas Ruflin <[email protected]>

docs/reference/ingest/processors/reroute.asciidoc

…ocessor

joegallo · 2023-04-13T16:25:44Z

Adding a comment for the record: I'm 99% of the way to +1 on the code. The only item outstanding for me is the discussion about leniency at #76511 (comment).

…ocessor

joegallo

🚀

felixbarny mentioned this pull request Aug 13, 2021

[Meta] Enable seamless ECS log onboarding for all log inputs elastic/integrations#1454

Closed

6 tasks

elasticsearchmachine added v8.0.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Aug 13, 2021

jtibshirani added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Aug 20, 2021

arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022

mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022

salvatore-campagna added v8.3.0 and removed v8.2.0 labels Mar 30, 2022

This was referenced Apr 1, 2022

[Feature Request] Enable ECS log detection by default using Custom Log integration elastic/integrations#1332

Closed

Ecs logs integration elastic/integrations#2972

Closed

felixbarny mentioned this pull request Apr 11, 2022

[Feature] Document based ingest routing #63798

Closed

felixbarny linked an issue Apr 11, 2022 that may be closed by this pull request

[Feature] Document based ingest routing #63798

Closed

ruflin reviewed Apr 11, 2022

View reviewed changes

...s/ingest-common/src/main/java/org/elasticsearch/ingest/common/DataStreamRouterProcessor.java Outdated Show resolved Hide resolved

ruflin reviewed Apr 11, 2022

View reviewed changes

docs/reference/ingest/processors/data-stream-router.asciidoc Outdated Show resolved Hide resolved

felixbarny force-pushed the data_stream_router_processor branch from 54e993a to 2cad2d4 Compare April 15, 2022 13:30

This was referenced Apr 15, 2022

Invoke default pipeline of new index #85931

Closed

Add ability to skip the rest of the processors in a pipeline #85932

Closed

Add data_stream_router processor

2d80949

felixbarny force-pushed the data_stream_router_processor branch from d7778a9 to 2d80949 Compare April 15, 2022 15:57

felixbarny marked this pull request as ready for review April 15, 2022 15:59

elasticmachine added the Team:Data Management Meta label for data/management team label Apr 15, 2022

Apply spotless suggestions

1d454b6

felixbarny self-assigned this Apr 3, 2023

Merge branch 'main' into data_stream_router_processor

5904db6

ruflin reviewed Apr 11, 2023

View reviewed changes

ruflin approved these changes Apr 11, 2023

View reviewed changes

Update docs/reference/ingest/processors/reroute.asciidoc

d043ae3

Co-authored-by: Nicolas Ruflin <[email protected]>

felixbarny mentioned this pull request Apr 12, 2023

Add data_stream fields to spec elastic/ecs-logging#38

Draft

joegallo added 5 commits April 12, 2023 09:35

Merge branch 'main' into data_stream_router_processor

b2b9e7a

Tidy up imports

d5491de

Prefer toList

e3c3ad1

Add a comment

7cc0a99

Drop a utility method, add a test

7647f9e

joegallo reviewed Apr 12, 2023

View reviewed changes

docs/reference/ingest/processors/reroute.asciidoc Outdated Show resolved Hide resolved

elasticmachine and others added 6 commits April 12, 2023 11:46

Merge branch 'main' into data_stream_router_processor

e43ac5f

Add sanitization tests

0ef9503

Move sanitization constants

7553638

Use regex for sanitization

c49e70b

Drop toString

0c29f73

Merge remote-tracking branch 'origin/main' into data_stream_router_pr…

55177e1

…ocessor

felixbarny added 2 commits April 18, 2023 17:51

Be more strict regarding non-string field references

e13451e

Merge remote-tracking branch 'origin/main' into data_stream_router_pr…

dbb7dd6

…ocessor

joegallo approved these changes Apr 18, 2023

View reviewed changes

felixbarny merged commit 11b598a into elastic:main Apr 18, 2023

ruflin mentioned this pull request Apr 19, 2023

Add 'singleton' flag to field mappers? #58523

Closed

felixbarny mentioned this pull request May 12, 2023

[Ingest Pipelines] Add support for reroute processor elastic/kibana#157481

Closed

axw mentioned this pull request May 23, 2023

Missing global labels for aggregated metrics for RUM elastic/apm-server#10848

Closed

gsantoro mentioned this pull request Jul 6, 2023

Allow users to specify dataset and namespace for rerouting data collected by kubernetes.container_logs datastream elastic/integrations#6845

Closed

zmoog mentioned this pull request Jul 25, 2023

Figure out how to route documents to a different data stream in Elasticsearch zmoog/public-notes#42

Open

zmoog mentioned this pull request Aug 1, 2023

[Azure Logs] Route logs categories to the most appropriate data stream elastic/integrations#7203

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reroute processor #76511

Add reroute processor #76511

felixbarny commented Aug 13, 2021 •

edited

Loading

ruflin Apr 11, 2022

felixbarny Apr 12, 2022

ruflin Apr 12, 2022

jsoriano Apr 12, 2022

felixbarny commented Apr 14, 2022

felixbarny commented Apr 14, 2022 •

edited

Loading

elasticmachine commented Apr 15, 2022

ruflin commented Apr 25, 2022

felixbarny commented Apr 25, 2022

ruflin commented Apr 25, 2022

ruflin left a comment

ruflin Apr 11, 2023

ruflin Apr 11, 2023

felixbarny Apr 11, 2023

ruflin Apr 11, 2023

joegallo commented Apr 13, 2023

joegallo left a comment

Add reroute processor #76511

Add reroute processor #76511

Conversation

felixbarny commented Aug 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixbarny commented Apr 14, 2022

Only set event.dataset if it is already set on the document to ensure that it's the exact same value as data_stream.dataset?

Don't treat event.dataset as a fallback for routing if data_stream.dataset is empty?

Should we sanitize data_stream.* fields that contain invalid characters?

Should we parse _index if data_stream.* fields are missing?

felixbarny commented Apr 14, 2022 • edited Loading

elasticmachine commented Apr 15, 2022

ruflin commented Apr 25, 2022

felixbarny commented Apr 25, 2022

ruflin commented Apr 25, 2022

ruflin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joegallo commented Apr 13, 2023

joegallo left a comment

Choose a reason for hiding this comment

felixbarny commented Aug 13, 2021 •

edited

Loading

Only set `event.dataset` if it is already set on the document to ensure that it's the exact same value as `data_stream.dataset`?

Don't treat `event.dataset` as a fallback for routing if `data_stream.dataset` is empty?

Should we sanitize `data_stream.*` fields that contain invalid characters?

Should we parse `_index` if `data_stream.*` fields are missing?

felixbarny commented Apr 14, 2022 •

edited

Loading