-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event routing: open questions and prototyping #3185
Comments
I think of this differently. There is one pipeline per data stream. It is well possible that 2 inputs are configured shipping data to the same data stream and the same routing would apply.
I would expect it to be based on the container name in most scenarios and then some sub routing based on a grok pattern. An example here is nginx with error and access logs. It is possible that one goes to stderr and the other stdout, so this meta data would have to be in the event itself. Agree, routing conditions should be as efficient as possible but there will be cases where grok will be needed. |
As long as the routing conditions for the nginx integration don't slow down the conditions for mysql, for example, that's fine I think.
How would a pipeline for - data_stream_router:
if: ctx.input?.type == 'filestream' && ctx.log?.file?.path?.contains('nginx')
dataset: nginx
- data_stream_router:
if: ctx.input?.type == 'container' && ctx.kubernetes?.container?.name?.contains('nginx')
dataset: nginx |
I don't think there should be a global "router" data stream, but instead one for each purpose. So we would have The nginx example above would look a bit different assuming this is all in k8s:
I just made up I know better understand on what you mean with "pipeline per input". I don't think of k8s as an input as in the end it is basically reading files from disk. It is more about the meta data that is attached to it. In some scenarios it is an input, like syslog, in others it is an environment like k8s / docker where the data comes from. There should be one routing pipeline for each environment data can come from. If we have a global generic router, it would route data based on the meta data and not content of the message itself:
|
That makes sense. It's not about the input type but about what kind of metadata is available. For example, when using a Kafka input, some events may have k8s metadata while others may need to be routed by looking at their log.file.path. That speaks for a global routing pipeline that doesn't check Still, each integration will have to provide different routing conditions based on available metadata.
I'm not sure if you're able to tell from k8s logs whether something was logged to stdout or stderr. Do you know if that's possible? At least we don't capture a similar field currently. |
Agree. And the condition can be very different for each metadata type. I can forsee that some integrations will only provide a routing rule for k8s at first but not syslog for example.
If we don't have it, it means nginx will just be forwarded to |
cc: @aspacca @kaiyan-sheng |
Here's a diagram of how the multi-stage event routing may look like:
|
@felixbarny Thank you for putting this diagram together. This generally matches the mental model I was envisioning as well. Multiple levels of routing pipelines that fan out into finer levels of granularity in order to reduce impact of routing conditionals between integrations/data sources.
Integrations are likely going to need to provide some default routing rules, but I also think we’re like to need a way for users to customize these rules, without editing ingest pipelines directly (as the pipelines should be fully managed by the system). A few obvious examples I can think of that would require this:
Another aspect we need consider is inspectability of routing rules. While the final routing decision will be obvious from the resulting destination index, we need to provide the ability to easily test these routing rules in our managed pipelines. This would especially be helpful during in any UI we build for customizing these routing rules. As the simulate pipeline API works today, this is not possible, because the simulate API does not run the pipeline of the new
Results only reflect the simulated pipeline and not the subsequent index’s pipeline:
We could likely emulate this behavior from the Kibana side and call the next pipeline’s simulate until the |
Maybe we could introduce something like a simulate mode for the index/bulk API that returns the final document instead of actually indexing it. |
from the point of view of elastic-serverless-forwarder a multi-stage event routing will be required, as showed by the diagram from @felixbarny the case is similar to the kafka one
I just wonder what can be the impact on performance for the general |
The idea is that the general
Could you provide an example of a raw Nginx log and how we'd find out that it's indeed an Nginx log? Is there any metadata available that we can use to quickly identify, with a relatively high confidence, that the log is indeed an Nginx log and should be parsed by the Nginx integration? If the answer is that we need to see whether a grok expression matches on every document, without being able to exclude most of the documents with checks that are cheaper than grok, it feels like that would be prohibitively expensive.
It mostly depends on how expensive the checks within the |
I had a discussion with @ruflin about what the default data stream should be if no rule matches. We didn't reach a final conclusion yet but I think it's an interesting question to surface to a wider group. Options, from coarse-grained to fine-grained:
I think we should go with option 3. It fits the ECS definition of data_stream.dataset very well:
A similar quote from the data stream blog
Data from the same container or application have a structure that is often distinct form other services. Frequently, the services are owned by different teams and have different retention requirements. Also, segmenting the data in this way makes it much easier to add structure to the data after ingestion via runtime fields. Adding structure via runtime fields is much harder if the logs of multiple sources are mixed together in a single default index, such as |
hi @felixbarny please, let me give some context about the topic from the currently we support auto-discovering AWS logs integrations: the customers does not need to provide any information (as a datastream where to ingest to, or any other specific hints about the integrations the logs belong to) and for 80% of the AWS integration we are able to send to the correct datastream in the default namespace. my initial thought was about relying instead on the event routing pipeline, especially if there will be a generic/AWS default one
what when metadata are not available? as in the kafka/elastic-serverless-forwarder case?
I will give an example about an AWS integration (native AWS integrations are the main support case for the elastic-serverless-forwarder):
80 % of the AWS integrations will rely on a grok expression match. the other 20% are json based and we can just check on a minimum set of not colliding fields in the json
in the specific case of the elastic-serverless-forwarder we have to understand what side of the process the performance impacts more. the forwarder is a lambda with limited resources and (most important) a timeout execution period that cannot exceed 15 mins. there is indeed a continuing mechanism in case an ingestion "batch" cannot be fulfilled in the 15 mins: so no events should be lost, but the longer they are "continued", the bigger the ingestion delay will be. from my stress tests the bottleneck were on the cluster side: if the cluster is not scaled to support the ingestion rate of the lambda, bulk requests will start to fail or timeout (that's worse from the lambda point of view, because it will timeout as well waiting for the bulk request response). for sure once we have to identify metadata parsing the message payload in order to provide them to the event router so that it does not a grok expression, then we already have the knowledge we'd need from the event routing :) |
Do you have examples on how the full events would look like? Maybe a gist or a link to the full payload that would be sent to Elasticsearch? Is there really no metadata event that would signal where the log is coming from? Is it just a raw message with no context? When events are coming from Kafka, they might still have rich metadata about the source associated with the event. For example, when sending k8s logs to Kafka first and then forwarding to Elasticsearch. In this case, the logs from Kafka would have the container name, for example. The job of the Kafka router would then be to find out which metadata is available and then to route accordingly.
Not saying this is impossible but it feels like this would be a massive overhead and only the very last resort. Is there any way to pre-group the logs into different data streams even if we can't associate an integration with them on ingest time? Then we could apply structure to these events after-the-fact by adding runtime fields to these data streams and/or adjust the routing to auto-discover the logs based on the source going forward. Is the serverless forwarder doing this type of routing client-side (meaning within the Lambda function) currently?
The main purpose is to move routing decisions from the client to to server so that it's more easily and dynamically adjustable and so that the clients can be more lightweight. It still relies on the routing decision to be rather cheap. While it's not impossible to route solely on grok expressions, this should be the very last resort and it's not what it is primarily designed for. I'd argue that any system that's designed around routing via grok primarily is bound to have a very high CPU impact on ingest and the impact gets larger the more integrations (and thus grok-based routing conditions) are enabled. |
hi @felixbarny , sorry for the late feedback
The problem is on the source events: see the three different events we can receive in the elastic-serverless-forwarder https://gist.github.com/aspacca/e7ed3e8b720081e6c23651af147211f0 (with dummy messages content) As you can see in the metadata from SQS, Kinesis and CloudWatch Logs we don't have anything we can match to an integrations.
we could have a few different strategies for that: for example by event source type (sqs/cloudwatch logs/kinesis).
could you make me an example about how this would work?
the serverless forwarder currently is doing a different type of routing client side: in the single case of an s3 notification to sqs as event source we have the path of the s3 object as metadata of the event. this path is deterministic for 80% of the native AWS integrations, and we leverage this for the automatic routing. |
This is the routing that I think would fit really well into the routing concept as it is simple based on a single string and a quick lookup. |
I agree, considering the performance implications emerged in the thread from the specific point of view of the serverless forwarder the cpu intensive parsing (grok expressions etc) should be kept on the edge. I just wonder if it does make sense to have two different routing places, one in the client and one in the cluster, identifying the two different scenarios where they should be applied, or rather is less complex and easier to maintain a single one localised in the client (given the specific situation of a compute expensive routing logic) |
I think ideally most of the routing should happen centrally as it allows to update routing without having to deploy any configs to the edge for changes or updates to the binary. If there is some routing that never changes it could be hardcoded on the edge but I would rather use it as a fallback option instead of making it part of the design. The key for me is that we get the routing concept in Elasticsearch as on the edge we can partially already do it. |
I was using input tags in the past, that was the way in Graylog My example is mostly related to syslog Now, my workaround is to use different IP addresses on the LB to differentiate the inputs(eg. 1.1.1.1:514 for a box and 1.1.1.2:514 for another box type) but that is not IP friendly as you need bigger pools. I was thinking of using Then(graylog time) I was routing based on remote ip(log.source.address in Elastic world) but that is mostly working for on-prem and just for transparent capable LBs|Proxys. We might have a collector in between. I think some form of edge tagging is still needed unless we want to regex a lot. Just my 2 cents |
Hi! We just realized that we haven't looked into this issue in a while. We're sorry! We're labeling this issue as |
We're making progress on event routing via the
data_stream_router
processor:I think now is the time to discuss remaining open questions on how integrations will actually use event routing rules and to start prototyping.
Open questions:
co.elastic.integration/mysql
.To answer these questions and to validate the overall approach, we should start building out a prototype. We can do that even with the
data_stream_router
processor not being merged yet. Just trying to define a concrete routing pipeline with a couple of example integrations should help us thinking through remaining open questions and validate the approach.We can also start measuring the performance impact of the routing pipeline conditions and what the expected log ingestion rate is. Do do that, we should define Rally tracks that simulate a typical event routing scenario.
cc @ruflin
The text was updated successfully, but these errors were encountered: