-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Support for document-based routing via ingest pipelines #151898
Comments
Pinging @elastic/fleet (Team:Fleet) |
We are thinking it of a dependency because it is the same ingest pipeline for the first implementation. But ideally, there would be a routing API and integrations would just add their bits to it, no necessarily creating a dependency. What if we build this API in Fleet for now that manages the rules and creates the pipeline out of it? What does it mean for the package spec? I'm hoping we can stay away mostly from having dependencies. |
Yeah to be clear, I didn't mean that packages would declare dependencies on one another. But there will be things to consider to ensure that every time a package is installed, all the appropriate pipelines are updated. |
I synced on this today with Observability and discussed what the next steps should be. We came to the conclusion that we should work on a design document that includes the following Package spec for routing rulesWe need to allow packages to define routing rules for how data should be routed from other integration data streams to the integration defining the rule. For example, the Nginx integration should be able to define a rule for the Docker integration on how to detect a document that contains and nginx log and route it to the nginx.access data stream. While the underlying implementation of the routing rule will be part of the integration's ingest pipeline, we want Fleet/EPM to have control over the order of when routing decisions happen. For these 2 reasons, routing rules should not be defined as part of an integration data stream's regular ingest pipeline. Instead they need to be defined in a separate file in the package. We should also not abstract away too much of the underlying Ingest pipeline designWe need to specify the order in which the following things will be executed in the pipeline:
We need to strike a balance between allow flexibility and preventing user customization from conflicting with processing and routing defined by integrations. This design needs to be tested against real use cases (to be provided by Observability). Part of this should also include the naming convention used for the different pipelines and sub-pipelines (eg Fleet API for adding user-defined routing rulesUsers need to be able to add their own routing rules. We may want to offer an API in Fleet/EPM to make manging this simpler. Later a UI may be built on top of this. The design of this API should include how ingest pipelines are updated to apply the new rules or remove ones that are deleted Internal tracking of rules and ingest pipeline updating procedureWe need to define how Fleet will internally track rules defined in packages and by end users, and how those rules will be installed into ingest pipelines with 0 disruption to ingestion. This needs to include how package installation will work when new rules are added, how package upgrades will work, and how package uninstallation will work. |
@joshdover I'd be interested in what we see the user experience being here. As a user with an nginx docker container in kubernetes, I might be tempted to first look for the nginx integration for example, but the standard flow would then prompt me for the log paths etc which is unnecessary. or if I go to install the kubernetes integration first, how will I know to then install the nginx integration to capture my logs? It's almost like we need a dedicated kubernetes onboarding which guides the user to select the kinds of containers they will be deploying and installs the matching integrations first then the kubernetes integration, that way the user starts re-routing data straight away. As a user I would also want to see if my data is successfully being re-routed, we could consider adding a field when we re-route the data to track the origin of the data and we could then aggregate on it to give the user a summary somewhere of data that is being re-routed. |
Good points @hop-dev. I'd hope we can come up with a solution that doesn't require the user to manually install integrations. Instead, we'd either pre-install or auto-install the integration when data flows to a corresponding data stream. IIRC, Josh did a spacetime on this. |
Ah yes I was involved in a spacetime project in this area: https://github.com/elastic/observability-dev/issues/2100 Slightly different to this use case, we used annotations on the containers themselves to prompt the installation of the integration. then configured the integration to consume the data directly |
Actually @hop-dev did the auto-install spacetime :) I did a related one to suggest packages to install based on signals in the common data. Both could be potential options here. I suspect this is important as well, and ideally we don't build a solution that only works for Kubernetes sources. I wonder if we could do some 80/20 solution where we pre-install just the ES assets for integrations that define routing rules for another integration. We can then watch for data showing up in those data streams and install the dashboards/other Kibana objects at that time. |
This makes a lot of sense to me |
Wouldn't it be rather to pre-install the ES assets for integrations that documents are being re-routed to? How else would Kibana know which integration to install? For @hop-dev's user journey question that would imply that the routing configuration is tied to the lifecycle of the source integration (such as k8s). |
I think we're saying the same thing? So if a user installs the k8s integration, Fleet would install the ES assets for all (or most popular) integrations that have routing rules specified for k8s container logs. I imagine that the registry APIs would need to expose a list of other integrations that have routing rules for a particular integration.
Good point, this would be one of the caveats. We could probably come up with ways to refresh this periodically, but it's a little less than ideal. I do suspect that users want more control over this - and we need a way to show them more integrations that may have routing rules that are relevant for their use case. This is where I think the design around auto-suggestion of integrations to install may be helpful, where routing only happens once the user has decided to install an integration to enhance the data that is extracted. Relevant spacetime: https://github.com/elastic/observability-dev/issues/2132 |
Synced with @kpollich about this today. User flowWe discussed the general concept and UX flow from Discover:
Example
Questions & some potential answers
|
@grabowskit @ruflin can you confirm our understanding of the UX and the APIs we're discussing? |
Using KQL to create routing rules sounds like a cool feature but also adds complexity. Could we start with a workflow where users manually create a painless condition?
Do we need to already create a new index template at that point? We could just rely on the default index template for One challenge is that a routing rules create a variable number of datasets. For example when routing on By relying on the
Why are we persisting routing rules as saved objects and not just as a processor in the routing pipeline? Not having a single source of truth could lead to the reroute processors and saved objects to not be in sync.
I was thinking that we could also just rely on the
I'm not sure if that's possible. The reroute processor needs to be added to the pipeline of the sink. Also, a single routing rule can result in creating multiple datasets that are unknown at the time the routing rule is created.
Hm, I see where you're coming from. But if both the sink and the destination have the same pipeline, we would do the same transformation twice and not all transformations are idempotent. |
@felixbarny One thing that is clear from the discussion so far is that the first priority use case is not defined. I'm not sure if we first want to target the simpler use case of "route these specific logs to a new data stream" or the more complex "route all of the logs to multiple data streams based on field X". Let's call these:
Eventually we'll need to support both, but which are we starting with?
You're right, I was focusing on use case (1). If there's not a specific target data stream for a routing rule, I agree that the default template can be relied on.
I'm thinking ahead to this "export as a package" aspect. For that to work, I think a specific package needs to own the routing rule, and for use case (1), the destination package/data stream should be the one to own the rule, not the sink. We could use the routing pipeline as the source of truth, but for use case (1) there still needs to be some link between the processor and the destination data stream. Maybe the contents of the reroute processor itself is good enough. You are right that the reroute processor itself would always be in the "sink" data stream's ingest pipeline, even if it's owned by another package/data stream. I'm not sure how to think of "export a package" for more dynamic routing rules that fan out to multiple data streams (use case (2)). Do you think these would be a better fit as
Yeah I thought about this more last night and I agree, this will be too hard to support. |
The reroute processor will support both use cases. I think we'll also want to update some integrations to make use of 2. relatively soon. For example, changing the syslog integration to route by app_name or the k8s integration to route by I will also be possible for users to just start ingesting data into a custom dataset that's not known beforehand. I don't think we'd want to require them to manually create an integration before they can start ingesting data. I suppose we'll need to be able to "lazily" create an integration. For example, at the time the user wants to add a pipeline to their custom dataset.
Ah, I see. That makes sense. It would be similar to the built-in Nginx integration adding a routing rule to the |
@joshdover happy to see that your results are pretty similar to what I naively had in mind 🎉
I wonder if we ever want these assets to fly around without an owning integration. Would this be the place where a new integration is created that owns them? |
Hi all. I'm starting to work on some technical definition for this work around Fleet and the Package Spec. I'd like to walk through a basic example to make sure I understand what we need to support here. Let's say we support routing rules at the package spec level, e.g. # In the nginx integration, which is our "destination", let's say a `routing_rules.yml` file
rules:
- description: Route Nginx Kubernetes container logs to the Nginx access data stream
source_data_stream: logs-kubernetes.router
destination_data_stream: logs-nginx.access
if: >
ctx?.container?.image?.name == 'nginx' ❓ I'm basing this When this integration is installed, Fleet will parse out this {
"processors":
[
{
"reroute": {
"tag": "nginx",
"if": "ctx?.container?.image?.name == 'nginx'",
"dataset": "nginx.access"
}
}
]
} ❓ If the Kubernetes integration isn't installed, would Fleet need to install it at this time? I understand the inverse case, where we'll need to "preinstall" all related component templates + ingest pipelines for "destination" packages as @joshdover mentioned in #151898 (comment). I understand there are other pieces here like
...but I'd just like to make sure I'm on the right path with the above example. Thanks! |
Spent some more time with this. I think I've boiled our needs here down into three main feature sets:
I've got some notes on the first 2 points here, but I'm still thinking through number 3. Here's a napkin sketch overview of what I've been brainstorming: Package Spec support for routing rulesGoal: allow integrations to ship with routing rules that will generate corresponding
Package spec: # In the nginx integration
rules:
- description: Route Nginx Kubernetes container logs to the Nginx access data stream
source_dataset: kubernetes.router
destination_dataset: nginx.access
if: >
ctx?.container?.image?.name == 'nginx' Supporting routing rules in the package spec would be a separate chunk of work from Fleet support, which I'll get into next. We'll need to make sure the spec + EPR endpoints all fully support routing rules as a first class resource to support our other features here. Optimistic installation of destination dataset assetsGoal: When an integration is installed, Fleet must also install index/component templates + ingest pipelines for any datasets to which the integration might potentially route data. Fleet needs a means of fetching all integrations to which the "data sink" integration currently being installed might route data. EPR should provide a [
{
"integration": "nginx",
"source_dataset": "kubernetes.router",
"destination_dataset": "nginx.access",
"if": "ctx?.container?.image?.name == 'nginx'"
},
{
"integration": "apache",
"source_dataset": "kubernetes.router",
"destination_dataset": "apache.access",
"if": "ctx?.container?.image?.name == 'apache'"
}
] This would allow Fleet to perform a lookup for any datasets where ❓ - Is limiting the assets installed for destination datasets necessary, or should we just install everything including Kibana assets? Could we just perform a "standard" installation as part of that process, or is the performance hit of doing that too substantial? I've got another napkin sketch for this part as well: I think with these two chunks of work, we'd be able to call package-based routing support done as far as Fleet is concerned. We'd be able to
I'll spend some more time with the customization API requirements here, but feel free to chime in with any feedback if you have it before I come back with more. |
Spent some time thinking through the CRUD API needs here, which I'll summarize below. Fleet will provide an API endpoint for persisting "custom logging integrations" which include:
An example API request might look like this:
❓ One thing I'm not sure on is the mappings provided. Does the user go through and define fields or select mapping types for the fields detected in their custom logs? Am I right in thinking these customization APIs are a fairly separate effort from the package-level routing rules work above? I think we could pretty realistically get started on supporting document based routing rules defined by packages without much more definition that what I've done here, but these customization APIs seem like they're still a bit in flux. Persisting these custom integrations (or "exporting" as it's been referred to a few times above) seems like a follow-up to the package level support. |
Maybe I'm missing something but I get stuck on the following statements repeated several times above:
Why do we need to install the assets of all potential destination packages when installing the source package? Isn't the explicit installation of the destination package what inserts the routing rule? Then why do we need the destination assets earlier already when there's no routing to it in place? |
I wasn't 100% clear on this. My assumption was that if we install the So, it sounds like the "Optimistic installation of destination dataset assets" isn't necessary here. Only when a user explicitly installs a "destination" integration will any "source" integrations' ingest pipelines be updated. |
My main source of confusion, I think, came from this prior comment
|
(apologies for the spam) @hop-dev's comment above is the source of the above conversation:
I think this point is still valid. Should we expect users to install the Nginx integration manually because it's a routing destination for Kubernetes container logs? How will we surface that knowledge and make it clear to users in an instance like this? It makes sense technically for us to defer the creation of Here's how I understand this would work without the optimistic installation process: |
Yes, that's still the plan. But we haven't implemented that, yet.
No, I don't think so. But if the k8s integration is installed at a later time, it should include the rule from the Ngingx package if it has already been installed.
Yes, I think we should start with that.
Good question. I think there's no need for the user to install the integration we can just store the raw logs and the users might be fine with that. Maybe we could suggest the user to install the Nginx integration when we detect that the user looks at Nginx logs. Other questions I have is whether we should install a destination integration when data is sent to a data stream that this integration manages? Also, would there be a way for users to set a label in their containers to give a hint about installing the integration? Overall, I think the most important routing workflow that we should support for now is not necessarily a destination integration to register a routing rule in a source integration but to do default routing in the source integrations on the service or container name and to enable users to add custom routing rules in source integrations. |
I would decouple this completely from routing and labels. This is something we should have in general. If data is shipped to a dataset and we have an existing integration for it, we should recommend it to be installed. |
That's a good question. While not explicitly mentioned anywhere, I think this could be one of the workflows started from the log explorer. When the user has selected the "unrefined" k8s logs data stream we could offer a starting point to jump into the workflow that installs a specialized integration (e.g. "nginx"). From then on the nginx docs would not show up in the k8s logs anymore but new nginx-related data streams would show up in the data stream selector. I wonder if we would want to start off the workflow in a generic manner or if there would be an efficient way to query which integrations have routing rules for k8s. Then we could directly offer a selection of routing-aware integrations at the start of the workflow. |
I keep stumbling over the
I consider the |
Integrations should also be able to route to But you bring up a good point. I don't think it's currently possible to route to
The
I'm going back-and-forth on that. I think it makes sense as a convention for datasets that aren't expected to contain any data and just do routing. However, some datasets, such as
I think we should not have them both appear in the same array. I'd even split these to different files. One file that has all the routing rules that go do the routing pipeline of the current dataset and another file that lets you add routing rules to other datasets. To me, that's the main distinction between the two different use cases rather than dynamic vs static rules: Routing rules for the current dataset and rules that are injected into other datasets. For rules that are injected into other datasets, we'll need to add a priority concept so that they're sorted accordingly. The ordering also necessitates having an identity for routing rules. The injected rules should also always go before any routing rules that the source dataset has defined itself. We probably also want to have a dedicated pipeline for the injected routing rules. To keep things simple for now, I think we should focus on the routing rules that are just added to the same dataset and not spend too much time on implementing the rule injection. |
👍
Perfect. Users will just need to be aware that documents that fail to route will remain in the "sink" data set. No real concerns here on my end.
The dataset value should be guaranteed unique by package validation, e.g.
Good catch. I wasn't sure on how we'd want to guarantee order here. Should the order be based on the order of processors as they appear in the YAML, with conditionless processors pushed to the end of the list? Part of me just wants to honor the order as they appear in the integration, but again it's more burden on the maintainers to understand the implementation details of reroute processors.
I'm not 100% sure about aligning package spec fields exactly with Elasticsearch APIs, fields, etc. It's not something we've been consistent about, but maybe that should change here. I do like the example of splitting these rules into different arrays rather than trying to reason about a mixture of use cases in a single list. Then, like you mentioned, we don't have to introduce new names for the existing concept of If
Integration assets generated by EPM are prefixed in most cases with the integration name. Would this mean Fleet needs to create an index template with a different
I don't think I completely follow this. Could you provide an example of what this routing rule setup would look like or a use case?
The plain
Hmm it actually might make sense that we need to support a dataset that's only the integration name if we have routing rules with
Yeah I'm +1 on splitting these rules into two distinct lists.
Fair enough. Using Ruflin's example above we'd focus first on supporting |
This is possible today, you have to set |
I took a pass at updating elastic/package-spec#514 based on the conversation above and a quick offline chat I had with @ruflin. I think the key part of this is the example # nginx/data_stream/nginx/manifest.yml
title: Nginx logs
type: logs
# This is a catch-all "sink" data stream that routes documents to
# other datasets based on conditions or variables
dataset: nginx
# Ensures agents have permissions to write data to `logs-nginx.*-*`
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true
routing_rules:
# Route error logs to `nginx.error` when they're sourced from an error logfile
- dataset: nginx.error
if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route access logs to `nginx.access` when they're sourced from an access logfile
- dataset: nginx.access
if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
namespace:
- {{labels.data_stream.namespace}}
- default
injected_routing_rules:
# Route K8's container logs to this catch-all dataset for further routing
k8s.router:
- dataset: nginx # Note: this _always_ has to be the current dataset - maybe we can infer this?
if: "ctx?.container?.image?.name == 'nginx'"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route syslog entries tagged with nginx to this catch-all dataset
syslog:
- dataset: nginx
if: "ctx?.tags?.contains('nginx')"
namespace:
- {{labels.data_stream.namespace}}
- default |
I hope that's not what it means. I was thinking that we'd just rely on the Maybe that's ok. If it's not, we'll need to think about how we could prefix the dataset with the integration name or how to add features to ES that would allow us to rely on the built-in
Let's take the following reroute processor as an example: - reroute:
dataset: "{{service.name}}" The resulting data stream would look like The reroute processor doesn't support something like this: - reroute:
dataset: "foo.{{service.name}}" The example |
I think for the |
I like how the two routing rule concepts have been narrowed down. But I wonder if there even is a need for the |
What I haven't been able to find in the description so far is whether the installation of the routing rules always happens or if the user gets a choice of which of the available routing rules they want to inject into the "source integration". If we didn't make installing the rules opt-in, the user couldn't easily install the k8s integration in parallel to the nginx integration without them influencing each other. Wouldn't that be a valid use-case too? |
For the scenario in that particular example I think you're right. But it's needed for use cases like these: type: logs
dataset: k8s
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true
routing_rules:
- dataset: {{kubernetes.container.name}} Hmm, good point about making routing rule injection opt-in. I guess that's another reason why we'd want to have both ways: injected and local routing rules as we can rely on local rules to always be installed. So while the |
The reason I like it in the manifest is because routing rules as ingest pipeline is more an implementation detail and I would prefer that package devs do not have to think through were to put the rules in ingest pipelines. Having is separate, will also allow us to "manage" these rules and show them to our users without having to read ingest pipelines.
This is a more generic feature I would like to see in the package manager: Users have an option to remove some of the assets / not install them. Like for example dashboards that are not needed or routing rules. And if needed later, it can be added. |
Isn't it only needed because there is no
I agree, and I'm not making an argument for adding it to the ingest pipeline directly. I was suggesting that we might get by with just the "injected" rules if we define them in the manifest of the "leaf" data streams instead of the package. The downside would be that we'd need to add some datasets only for the purpose of routing, but on the upside we'd only have a single way to write rules. |
The idea is that all the definitions are in the dataset / data stream. My understanding is we need both as otherwise we have no way for community packages to extend the routing rules. |
I've taken the example manifest provided above and converted it to use inject-only routing rules, e.g. # nginx/data_stream/nginx/manifest.yml
title: Nginx logs
type: logs
# This is a catch-all "sink" data stream that routes documents to
# other datasets based on conditions or variables
dataset: nginx
# Ensures agents have permissions to write data to `logs-nginx.*-*`
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true
injected_routing_rules:
# Route K8's container logs to this catch-all dataset for further routing
k8s.router:
- dataset: nginx # Note: this _always_ has to be the current dataset - maybe we can infer this?
if: "ctx?.container?.image?.name == 'nginx'"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route syslog entries tagged with nginx to this catch-all dataset
syslog:
- dataset: nginx
if: "ctx?.tags?.contains('nginx')"
namespace:
- {{labels.data_stream.namespace}}
- default
---
# nginx/data_stream/access.yml
title: Nginx access logs
type: logs
injected_routing_rules:
# Inject a routing rule into the nginx dataset's pipeline that routes access
# logs to this data stream instead based on a document's file path
nginx:
- dataset: nginx.access
if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
namepsace:
- {{labels.data_stream.namespace}}
- default
---
# nginx/data_stream/error.yml
title: Nginx error logs
type: logs
injected_routing_rules:
# Inject a routing rule into the nginx dataset's pipeline that routes error
# logs to this data stream instead based on a document's file path
nginx:
- dataset: nginx.error
if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
namepsace:
- {{labels.data_stream.namespace}}
- default I think the idea is intriguing, but to me, this seems harder to reason about. Having only a single means of defining routing rules is arguably less complex, but rules now have to be spread across multiple files because their defined only at the "leaf" level as @weltenwort described above. Also if we coalesced on this "inject only" idea, we'd probably just call this field "routing rules" and keep the key/value structure here. This would allow us to "inject" routing rules into the current dataset as well as external ones, e.g. # nginx/data_stream/nginx/manifest.yml
dataset: nginx
routing_rules:
# Add routing rules to _this_ dataset
nginx:
...
# Add routing rules to _other_ datasets
k8s.router:
...
syslog:
... Curious if others think the same or if this seems reasonable - you folks contribute more directly to integrations than I do as I'm more on the spec/tooling side of things generally. |
I've moved the "User defined custom datasets" part of the description here entirely to #155911 where custom integrations at large will be tracked by the ingest team. |
I like the idea of having only 1 block for routing rules as you describe above. The part I keep stumbling over is the
Is this the most common case or the less common case. We don't know yet. The main thing I'm worried about is errors. The user renames the
It would require a bit more handling on the Fleet side but would make it explicit that these are the "local" rules. I would assume Fleet needs to handle these rules anyways a bit different as these have lower priority then the injected rules. |
I think this could be an elegant solution, but I don't know about optimizing for this specific change. Is implementing this magic There's also a possibility that changing this dataset value breaks injected routing rules from other integrations the maintainer doesn't even know about. For example:
I'd rather have no magic keyword than a magic keyword that only fixes some human errors. Changing the dataset for a given data stream should require thought and care. I've taken a pass at rewriting the example manifest above to use a single # nginx/data_stream/nginx/manifest.yml
title: Nginx logs
type: logs
# This is a catch-all "sink" data stream that routes documents to
# other datasets based on conditions or variables
dataset: nginx
# Ensures agents have permissions to write data to `logs-nginx.*-*`
elasticsearch.dynamic_dataset: true
elasticsearch.dynamic_namespace: true
routing_rules:
# "Local" routing rules are included under this current dataset, not a special case
nginx:
# Route error logs to `nginx.error` when they're sourced from an error logfile
- dataset: nginx.error
if: "ctx?.file?.path?.contains('/var/log/nginx/error')"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route access logs to `nginx.access` when they're sourced from an access logfile
- dataset: nginx.access
if: "ctx?.file?.path?.contains('/var/log/nginx/access')"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route K8's container logs to this catch-all dataset for further routing
k8s.router:
- dataset: nginx
if: "ctx?.container?.image?.name == 'nginx'"
namespace:
- {{labels.data_stream.namespace}}
- default
# Route syslog entries tagged with nginx to this catch-all dataset
syslog:
- dataset: nginx
if: "ctx?.tags?.contains('nginx')"
namespace:
- {{labels.data_stream.namespace}}
- default
|
I like what you have above. With this we can also improve it in iteration and add these kind of features if needed. Same for the permissions. If you do routing, very likely you need more permissions. But for now we can leave this to the dev. |
LGTM. I like the simplicity. But we shouldn't forget about the rule priority:
We could either introduce a parameter that determines the order or implicitly assign the lowest priority to local rules. |
I think we can probably avoid these errors by validating that the dataset exists in any integration. Maybe we can't if we want to support external integrations that aren't inside the elastic/integrations repo. |
Are you referring to "exists" in an installed integration or in general? For example: The nginx package contains a routing rule for k8s but k8s integration is not installed, k8s routing dataset does not exist. In this scenario, user should still be able to install the nginx integration but the k8s routing rule should not be added. Only when the k8s integration is installed, it should be added. This goes back to #151898 (comment) |
I meant we could do that validation at development time (not at package installation time) with a tool like |
I think we talk about 2 different things here: I want to validate, that the dataset is setup as part of an integration. I think you talk about if it exists in general in any integration. This is not possible as elastic-package can't be aware of all integrations. |
We had some more discussions on the topic around routing rules. I'm proposing for now to not focus on the injected rules but only the "local" rules. To be specific, a dataset always loads its own rules and users can eventually manually add rules to it (not through packages). For now, nginx can't add a rule to k8s. The goal is that this simplifies the implementation, we can always add the feature later. On the yaml side, I would keep the proposed structure to make sure the door stays open but it means we should do some validation that the names are equal. There is a second topic that go raised: Lets assume a user has multiple K8s clusters. Each cluster is shipping data to their own namespace. Different routing rules might be specified per cluster / namespace. Same can apply for the syslog dataset. There might be multiple syslog endpoints receiving very different data. Being able to define rules per namespace could be very useful in this context. The assumption is that these would be all rules specified by the user, not loaded by a package. Long term, it would be nice if the user could also export these manual rules in a package. |
Thanks for the summary of the discussion. I 100% agree with the focus on local rules first, followed by user-defined rules, then we'll figure out how we want the cross-package rules to work. The first two are much simpler cases and unlock a lot of functionality. For the namespace-specific rules, it seems we'll finally need to go implement #121118 |
Thanks for the summary 🙏 - I think we've landed on the right path forward here.
+1 on maintaining the structure defined above for future-proofing here. I'll make this clear in the implementation issues. |
I've updated the following issues based on our conversation above |
Even though injected rules are not supported, yet, I think we can close this issue as completed. I think we'll want to wait with the implementation for injected rules until we get more feedback and collect more real-world use cases, or have a concrete plan for how we exactly want to use injected rules in our integrations. Thanks everyone for your work on this 🙏 |
Thanks @felixbarny I will for now postponed the work on injected rules then. |
We are moving forward with a solution for "document-based routing" based a new ingest pipeline processor, called the
reroute
processor. Fleet will be responsible for managing routing rules defined by packages and the end user, and updating ingest pipelines to include the new data_stream_router processor to apply the rules.Links
Overview diagram
Integration-defined routing rules
Integrations will be able to define routing rules about how data from other integrations or data streams should be routed to their own data stream. For example, the
nginx
package may define a routing rule for thelogs-kubernetes.container_logs
data stream to route logs tologs-nginx.access
whenevercontainer.image.name == "nginx"
. Similarly, when thekubernetes
package is installed and thenginx
was also previously installed, we'll need to ensure thelogs-kubernetes.router-{version}
ingest pipeline includes areroute
processor for each routing rule defined on thenginx
integration.To support this, we'll need to add a concept of routing rules to the package spec and add support for them in Fleet.
Integration-defined routing rules
Supporting tasks
We'll need to do a few things in support of these changes as well, namely around API key permissions.
Supporting tasks
cc @ruflin @felixbarny
The text was updated successfully, but these errors were encountered: