-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reroute processor #76511
Add reroute processor #76511
Conversation
private static final String DATA_STREAM_DEFAULT_NAMESPACE = "default"; | ||
private static final String EVENT_DATASET = "event.dataset"; | ||
|
||
private static final char[] DISALLOWED_IN_DATASET = new char[] { '\\', '/', '*', '?', '\"', '<', '>', '|', ' ', ',', '#', ':', '-' }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure this is aligned with what we have in the https://github.com/elastic/package-spec for validation? @mtojek Will likely know where to point you to. Same for namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the package spec currently defines the allowed characters for data_stream.*
fields. See also the spec for dataset
. I've just tried using invalid characters in the manifest of an integration, such as upper-case chars and -
. Both elastic-package lint
and elastic-package build
did not yield an error.
I took the validation rules from https://github.com/elastic/ecs/blob/main/rfcs/text/0009-data_stream-fields.md#restrictions-on-values and https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html#indices-create-api-path-params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this issue open since long ago elastic/package-spec#57, I will ensure this ends up in our backlog.
...s/ingest-common/src/main/java/org/elasticsearch/ingest/common/DataStreamRouterProcessor.java
Outdated
Show resolved
Hide resolved
Discussing some option questions: Only set
|
Implemented in 41c23cc. Implications:
|
54e993a
to
2cad2d4
Compare
d7778a9
to
2d80949
Compare
Pinging @elastic/es-data-management (Team:Data Management) |
@felixbarny My general comment around #76511 (comment) is, lets keep the first iteration as simple as possible. It is always possible to add more features.
|
Currently, the processor only makes sure that if the source event contains
What are the cases where you'd want to have sanitization off? Could we make sanitization on by default (which is what the PR currently does) and later add a switch to disable it?
I've added that already. At the same time, I've removed the ability to set |
If it does the sanitization, will it also cleanup the document itself meaning modifying data_stream.dataset value? My suggestion having it off by default is mostly my instinct to disable magic by default which requires users to think about why they do not match the default. But I see your scenario that users just want to ingest logs and should not be exposed to this detail. I'm on board with turning it on by default which could also mean not having the config at first ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments for the docs. But this should not hold this PR back. We can always follow up with more detailed docs.
What if a users users type: foo
and then the data stream values? It seems we don't validate the type anywhere? It should only be los, metrics, traces or synthetics.
Supports field references with a mustache-like syntax (denoted as `{{double}}` or `{{{triple}}}` curly braces). When resolving field references, the processor replaces invalid characters with `_`. Uses the `<dataset>` part of the index name as a fallback if all field references resolve to a `null`, missing, or non-string value. | ||
| `namespace` | no | `{{data_stream.namespace}}` a| Field references or a static value for the namespace part of the data stream name. See the criteria for <<indices-create-api-path-params, index names>> for allowed characters. Must be no longer than 100 characters. | ||
|
||
Supports field references with a mustache-like syntax (denoted as `{{double}}` or `{{{triple}}}` curly braces). When resolving field references, the processor replaces invalid characters with `_`. Uses the `<namespace>` part of the index name as a fallback if all field references resolve to a `null`, missing, or non-string value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a *
and write it below just once? There you could also add the invalid chars list.
{ | ||
"reroute": { | ||
"tag": "nginx", | ||
"if" : "ctx?.log?.file?.path?.contains('nginx')", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to describe a little bit more around how this if condition work. Why ctx
is there and all the ?
. If you are not used to write ingest pipeline (conditions), this might not be obvious. I would do a very quick description here and then link to the ES docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docs contain a description of the if
option and a link to the corresponding docs. Isn't that enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we do in most places of our docs and I really don't like it. The link is important but for a user to be successful, a user should not have to jump through 3 doc pages to get a single task completed. Instead I would rather repeat some of the docs that the basic use case is all covered in on flow.
modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/RerouteProcessor.java
Show resolved
Hide resolved
Co-authored-by: Nicolas Ruflin <[email protected]>
Adding a comment for the record: I'm 99% of the way to +1 on the code. The only item outstanding for me is the discussion about leniency at #76511 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
Docs preview
The reroute processor allows to route a document to another target index or data stream. It has two main modes:
When setting the destination option, the target is explicitly specified and the dataset and namespace options can’t be set.
When the destination option is not set, this processor is in a data stream mode. Note that in this mode, the reroute processor can only be used on data streams that follow the data stream naming scheme. Trying to use this processor on a data stream with a non-compliant name will raise an exception.
The name of a data stream consists of three parts: --. See the data stream naming scheme documentation for more details.
This processor can use both static values or reference fields from the document to determine the dataset and namespace components of the new target. See Table 38, “Reroute options” for more details.
It’s not possible to change the type of the data stream with the reroute processor.
After a reroute processor has been executed, all the other processors of the current pipeline are skipped, including the final pipeline. If the current pipeline is executed in the context of a Pipeline, the calling pipeline will be skipped, too. This means that at most one reroute processor is ever executed within a pipeline, allowing to define mutually exclusive routing conditions, similar to a if, else-if, else-if, … condition.
The reroute processor ensures that the data_stream.<type|dataset|namespace> fields are set according to the new target. If the document contains a event.dataset value, it will be updated to reflect the same value as data_stream.dataset.
Note that the client needs to have permissions to the final target. Otherwise, the document will be rejected with a security exception which looks like this:
Reroute options
destination
dataset
ornamespace
option is set.dataset
{{data_stream.dataset}}
-
and must be no longer than 100 characters. Example values arenginx.access
andnginx.error
.Supports field references with a mustache-like syntax (denoted as
{{double}}
or{{{triple}}}
curly braces). When resolving field references, the processor replaces invalid characters with_
. Uses the<dataset>
part of the index name as a fallback if all field references resolve to anull
, missing, or non-string value.namespace
{{data_stream.namespace}}
Supports field references with a mustache-like syntax (denoted as
{{double}}
or{{{triple}}}
curly braces). When resolving field references, the processor replaces invalid characters with_
. Uses the<namespace>
part of the index name as a fallback if all field references resolve to anull
, missing, or non-string value.The
if
option can be used to define the condition in which the document should be rerouted to a new target.The dataset and namespace options can contain either a single value or a list of values that are used as a fallback.
If a field reference evaluates to
null
, is not present in the document, the next value or field reference is used.If a field reference evaluates to a non-
String
value, the processor fails.In the following example, the processor would first try to resolve the value for the
service.name
field to determine the value fordataset
.If that field resolves to
null
, is missing, or is a non-string value, it would try the next element in the list.In this case, this is the static value
"generic
".The
namespace
option is configured with just a single static value.Depends on