Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reroute processor #76511

Merged
merged 40 commits into from
Apr 18, 2023
Merged

Conversation

felixbarny
Copy link
Member

@felixbarny felixbarny commented Aug 13, 2021

Docs preview

The reroute processor allows to route a document to another target index or data stream. It has two main modes:

When setting the destination option, the target is explicitly specified and the dataset and namespace options can’t be set.

When the destination option is not set, this processor is in a data stream mode. Note that in this mode, the reroute processor can only be used on data streams that follow the data stream naming scheme. Trying to use this processor on a data stream with a non-compliant name will raise an exception.

The name of a data stream consists of three parts: --. See the data stream naming scheme documentation for more details.

This processor can use both static values or reference fields from the document to determine the dataset and namespace components of the new target. See Table 38, “Reroute options” for more details.

It’s not possible to change the type of the data stream with the reroute processor.
After a reroute processor has been executed, all the other processors of the current pipeline are skipped, including the final pipeline. If the current pipeline is executed in the context of a Pipeline, the calling pipeline will be skipped, too. This means that at most one reroute processor is ever executed within a pipeline, allowing to define mutually exclusive routing conditions, similar to a if, else-if, else-if, … condition.

The reroute processor ensures that the data_stream.<type|dataset|namespace> fields are set according to the new target. If the document contains a event.dataset value, it will be updated to reflect the same value as data_stream.dataset.

Note that the client needs to have permissions to the final target. Otherwise, the document will be rejected with a security exception which looks like this:

{"type":"security_exception","reason":"action [indices:admin/auto_create] is unauthorized for API key id [8-dt9H8BqGblnY2uSI--] of user [elastic/fleet-server] on indices [logs-foo-default], this action is granted by the index privileges [auto_configure,create_index,manage,all]"}

Reroute options

Name Required Default Description
destination no - A static value for the target. Can't be set when the dataset or namespace option is set.
dataset no {{data_stream.dataset}} Field references or a static value for the dataset part of the data stream name. In addition to the criteria for <<indices-create-api-path-params, index names>>, cannot contain - and must be no longer than 100 characters. Example values are nginx.access and nginx.error.

Supports field references with a mustache-like syntax (denoted as {{double}} or {{{triple}}} curly braces). When resolving field references, the processor replaces invalid characters with _. Uses the <dataset> part of the index name as a fallback if all field references resolve to a null, missing, or non-string value.
namespace no {{data_stream.namespace}} Field references or a static value for the namespace part of the data stream name. See the criteria for <<indices-create-api-path-params, index names>> for allowed characters. Must be no longer than 100 characters.

Supports field references with a mustache-like syntax (denoted as {{double}} or {{{triple}}} curly braces). When resolving field references, the processor replaces invalid characters with _. Uses the <namespace> part of the index name as a fallback if all field references resolve to a null, missing, or non-string value.

The if option can be used to define the condition in which the document should be rerouted to a new target.

{
  "data_stream_router": {
    "tag": "nginx",
    "if" : "ctx?.log?.file?.path?.contains('nginx')",
    "type": "logs",
    "dataset": "nginx"
  }
}

The dataset and namespace options can contain either a single value or a list of values that are used as a fallback.
If a field reference evaluates to null, is not present in the document, the next value or field reference is used.
If a field reference evaluates to a non-String value, the processor fails.

In the following example, the processor would first try to resolve the value for the service.name field to determine the value for dataset.
If that field resolves to null, is missing, or is a non-string value, it would try the next element in the list.
In this case, this is the static value "generic".
The namespace option is configured with just a single static value.

{
  "reroute": {
    "dataset": [
        "{{service.name}}",
        "generic"
    ],
    "namespace": "default"
  }
}

Depends on

private static final String DATA_STREAM_DEFAULT_NAMESPACE = "default";
private static final String EVENT_DATASET = "event.dataset";

private static final char[] DISALLOWED_IN_DATASET = new char[] { '\\', '/', '*', '?', '\"', '<', '>', '|', ' ', ',', '#', ':', '-' };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure this is aligned with what we have in the https://github.com/elastic/package-spec for validation? @mtojek Will likely know where to point you to. Same for namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the package spec currently defines the allowed characters for data_stream.* fields. See also the spec for dataset. I've just tried using invalid characters in the manifest of an integration, such as upper-case chars and -. Both elastic-package lint and elastic-package build did not yield an error.

I took the validation rules from https://github.com/elastic/ecs/blob/main/rfcs/text/0009-data_stream-fields.md#restrictions-on-values and https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html#indices-create-api-path-params.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtojek @jsoriano Can you chime in here? I would have expected that we do some validation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this issue open since long ago elastic/package-spec#57, I will ensure this ends up in our backlog.

@felixbarny
Copy link
Member Author

Discussing some option questions:

Only set event.dataset if it is already set on the document to ensure that it's the exact same value as data_stream.dataset?

This has the implication that ECS loggers need to set event.dataset in addition or instead of data_stream.dataset so that the logs rate component in the logs ui works properly. But to be compatible with the widest range of stack versions and UI features that rely on event.dataset, ECS loggers will need to set both anyway. See also elastic/ecs-logging#38

Overall a net-positive change in my books to not overly promote event.dataset. Implemented in c30141d.

Don't treat event.dataset as a fallback for routing if data_stream.dataset is empty?

This doesn't need to be part of this processor. If this makes sense for specific use cases (I'm thinking about ECS logs), we can just add a processor like this:

- set:
    field: data_stream.dataset
    copy_from: event.dataset
    if: ctx.event?.dataset instanceof String && ctx.event.dataset.length() > 1
    override: false

If we can rely on the fact that shippers always populate data_stream.* fields when sending events to a routing pipeline, we could just ignore event.dataset and simplify the logic.

There's a special case for ECS loggers where the shipper would send the data to a generic routing pipeline but

Other than having the processor natively falling back to event.dataset, users could just use a set processor if they want the fallback.

However, this has the implication that if event.dataset: foo is set on the document, and data_stream.dataset is unset, the doc will be routed to the generic dataset and event.dataset will be overridden to generic. That's to comply with the spec that event.dataset and data_stream.dataset should be the same.

This is quite a constructed case, however, as I can't think of a valid case where event.dataset is set and data_stream.dataset is missing. That's even true for ECS loggers. While the might set only the former, the log message will end up as a JSON-encoded string in the message field. We can make a fallback in the specific ECS JSON log pipeline.

Should we sanitize data_stream.* fields that contain invalid characters?

I think we should. This makes it much easier to route events to a data stream with dynamically computed values. For example using azure.springcloudlogs.app_name as the dataset. There's no restrictions on the allowed characters for this field. Still, it makes for a good dataset. Rather than each processor having to sanitize dynamic values, it seems better to rely on the data_stream_router processor for sanitization.

Should we parse _index if data_stream.* fields are missing?

While our integrations set data_stream.* fields, there's no general guarantee that these are available. Especially when events are coming from shippers we don't control, these might be missing.

It seems sensible for this processor to guarantee that the index name is always consistent with the data_stream.* fields. It could do so by falling back to parsing _index rather than always using dataset: generic namespace: default as a fallback.

But I'd say this is an optional enhancement of the processor that's not necessarily in scope of the first iteration.

@felixbarny
Copy link
Member Author

felixbarny commented Apr 14, 2022

Should we parse _index if data_stream.* fields are missing?

Implemented in 41c23cc.

Implications:

  • Processor can only be used on data streams matching <type>-<dataset>-<namespace>
  • Ensures data_stream.* fields are in sync with _index, even if the source doc didn't contain data_stream.* fields
  • Disallow changing type, only inferred from _index
  • The type is not configurable on the processor

@felixbarny felixbarny force-pushed the data_stream_router_processor branch from d7778a9 to 2d80949 Compare April 15, 2022 15:57
@felixbarny felixbarny marked this pull request as ready for review April 15, 2022 15:59
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Apr 15, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@ruflin
Copy link
Member

ruflin commented Apr 25, 2022

@felixbarny My general comment around #76511 (comment) is, lets keep the first iteration as simple as possible. It is always possible to add more features.

  • Lets ignore event.dataset, user can use set processor
  • Sanitise: Can we skip it at first or make it a config option that is off by default? I agree that we will need it eventually.
  • Parse _index: Would be great, but could be done as an enhancement

@felixbarny
Copy link
Member Author

Lets ignore event.dataset, user can use set processor

Currently, the processor only makes sure that if the source event contains event.dataset it doesn't conflict with data_stream.dataset. But I agree that this could also be done with a set processor. However, the risk of that is that event.dataset and data_stream.dataset may be inconsistent.

Sanitise: Can we skip it at first or make it a config option that is off by default? I agree that we will need it eventually.

What are the cases where you'd want to have sanitization off? Could we make sanitization on by default (which is what the PR currently does) and later add a switch to disable it?

Parse _index: Would be great, but could be done as an enhancement

I've added that already. At the same time, I've removed the ability to set data_stream.type in the processor as it will always use the type derived from the index name. Hence, I think this should go in with the first version as it wouldn't be possible to add this later without breaking changes.

@ruflin
Copy link
Member

ruflin commented Apr 25, 2022

What are the cases where you'd want to have sanitization off? Could we make sanitization on by default (which is what the PR currently does) and later add a switch to disable it?

If it does the sanitization, will it also cleanup the document itself meaning modifying data_stream.dataset value? My suggestion having it off by default is mostly my instinct to disable magic by default which requires users to think about why they do not match the default. But I see your scenario that users just want to ingest logs and should not be exposed to this detail. I'm on board with turning it on by default which could also mean not having the config at first ;-)

@felixbarny felixbarny self-assigned this Apr 3, 2023
Copy link
Member

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments for the docs. But this should not hold this PR back. We can always follow up with more detailed docs.

What if a users users type: foo and then the data stream values? It seems we don't validate the type anywhere? It should only be los, metrics, traces or synthetics.

docs/reference/ingest/processors/reroute.asciidoc Outdated Show resolved Hide resolved
Supports field references with a mustache-like syntax (denoted as `{{double}}` or `{{{triple}}}` curly braces). When resolving field references, the processor replaces invalid characters with `_`. Uses the `<dataset>` part of the index name as a fallback if all field references resolve to a `null`, missing, or non-string value.
| `namespace` | no | `{{data_stream.namespace}}` a| Field references or a static value for the namespace part of the data stream name. See the criteria for <<indices-create-api-path-params, index names>> for allowed characters. Must be no longer than 100 characters.

Supports field references with a mustache-like syntax (denoted as `{{double}}` or `{{{triple}}}` curly braces). When resolving field references, the processor replaces invalid characters with `_`. Uses the `<namespace>` part of the index name as a fallback if all field references resolve to a `null`, missing, or non-string value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a * and write it below just once? There you could also add the invalid chars list.

{
"reroute": {
"tag": "nginx",
"if" : "ctx?.log?.file?.path?.contains('nginx')",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to describe a little bit more around how this if condition work. Why ctx is there and all the ?. If you are not used to write ingest pipeline (conditions), this might not be obvious. I would do a very quick description here and then link to the ES docs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs contain a description of the if option and a link to the corresponding docs. Isn't that enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we do in most places of our docs and I really don't like it. The link is important but for a user to be successful, a user should not have to jump through 3 doc pages to get a single task completed. Instead I would rather repeat some of the docs that the basic use case is all covered in on flow.

@joegallo
Copy link
Contributor

Adding a comment for the record: I'm 99% of the way to +1 on the code. The only item outstanding for me is the discussion about leniency at #76511 (comment).

Copy link
Contributor

@joegallo joegallo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Data Management Meta label for data/management team v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Document based ingest routing