-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a scrape-time recording rules component to Alloy #414
Comments
grafana/agent#393 prototyped the basis of something very similar to this, where you could apply recording rules against individual scrapes, but we never found the time to finish the prototype and add it officially. What kind of aggregation are we talking about here: within the scope of a single scrape, or across all samples sent to the proposed component? |
I believe that our use case would be satisfied with aggregations across a single scrape. The output should be independent of what went through the component previously, as our goal is to basically construct a reduced set of metrics from the currently reported samples. Besides, I don't think that persistent storage for aggregating across all samples that passed through the component is something we would want, as it would increase the complexity of our supposedly thin storage. |
I don't speak for all the project maintainers, but I'd personally be in favor of reviving grafana/agent#393 and having a new Flow component which performs recording rules / aggregations against a single scrape 👍 It wouldn't permit range queries (i.e., no rates or _over_times), but I think it's a great starting point for doing client-side aggregations and cutting costs. |
Did this get any traction past the proposal state? My use case could really use this. Some of our tenants have very large metric sets, high in both cardinality and sample counts. They use recording rules in my team's Mimir instance to reduce that cardinality and sample size, but since the raw data set is so large, the recording rules will sometimes fail to finish processing before the next evaluation resulting in gaps in the metric series that the recording rules produce. They already run in their own rule groups, so the Mimir rulers can't spread them out any further. It would be really nice if we could offload some of that aggregation upstream inside the Grafana Agent sender. |
Hey, thanks for opening this proposal, I think that it makes a lot of sense :) We have a new proposal review process. I'd be happy to move your proposal to the active stage if you could add a little bit more details (could you rework it to use our template?) What do you think @mikemykhaylov? |
A very similar idea is being implemented in prometheus. If it's accepted we could integrate it in the scrape component of Alloy |
Sure, I'd be glad to rework it to the template! The Prometheus PR does look like something that could be used here; if that work wraps up I'll definitely take a crack at adding it to Alloy, in part as an excuse to get a refresher on Go 😅 |
I'm moving this proposal to "Active" state to discuss it more in-depth with the team and reach a consensus. |
If prometheus adds it and we can use it, then that seems like a big win. Wait and see then. |
Just to follow up, do we have some updates on this? Adding a component that support it would be super helpful to reduce the metrics cardinality. |
Hey, One concern about it was expressed in the Prometheus PR: How should counter resets be handled? Unless there are blocking concerns to the proposal, I will move it to "Likely accepted" by the end of the week. Then we will wait one more week to wait in case a major concern arises. If not then we will move the proposal to "Accepted" and anyone will be free to implement it (please mention it in this issue before starting the work to avoid having several people doing it in parallel) |
The proposal is accepted :) As stated above, anyone is free to pick it up, just mention it before starting the implementation to avoid duplication |
How should this work with clustering? For example:
The metrics required to aggregate would be on different Alloy instances, since we shard the targets. How would we want to handle this? |
Same as Prometheus handles it when sharding is activated. |
Thinking on this more this would not be a direct port/conversion of prometheus but something a bit different. Conceptually more akin to adaptive metrics. I think @sjentzsch is that it only applies to a single target scrape, the actual proposal might be closer to something like adapative metrics. |
@sjentzsch What are you referring to when you say Prometheus sharding? I'm not sure which Prometheus feature you're referring to.
If we did the above on my example, let's say, to calculate the
We would get something like this:
without any further handling, one of these writes would fail with duplicate sample / out-of-order. |
Proposal: Scrape-time recording rules component
Abstract
We add a new component to Grafana Alloy along the lines of
prometheus.recordrules
that allows us to specify recording rules to run at scrape time in a similar way to what Prometheus does at store time.Problem
In our K8s environments, one of the types of metrics we collect with Grafana Alloy are the network flows between pods and outside IPs. Due to the clusters' size, as well as having explicit "source" and "destination" labels on the metrics, the cardinality is exploding, easily reaching above 10k metrics reported per node. We would like to conditionally reduce the cardinality of a couple of metrics, preserving specific labels and dropping others, prior to doing a remote write to Mimir.
Using
prometheus.relabel
to drop labels is not an option, because in our use case, it would generate duplicate metrics (i.e. there are metrics that are the same by all labels except those that we would like to drop), and so Prometheus would drop them (link to post explaining this). What we need instead is, paraphrasing the articleAssuming the multitude of labels that would be tedious to enumerate, an obvious solution to this would be the use of recording rules. Namely, rule like the following would work (followed by the relabel dropping the original metrics)
However, if we set up this rule in Mimir, this doesn't really help, as the metrics are already stored. If, instead, the rules ran on the Alloy, only the Alloy would have to deal with the metrics, while Mimir would only see the reduced-cardinality metrics
Proposal
A new component is introduced called
prometheus.recordrules
. It allows the user to configure a set of recording rules that will run at scrape time. The rules' syntax will be a subset of PromQL, disallowing queries over a range of time. All queries are evaluated against only the metrics collected in the current scrape iteration, and previously collected metrics are not stored.Pros
Cons
Compatibility
Adding a new component is backwards-compatible
Implementation
We could have a thin
storage.Appendable
andstorage.Queryable
implementation which is exported as a receiver and appended to; we would then convert our rule configuration into Prometheus-nativeRuleGroup
, evaluate it, and register new metrics in the next component's receiver. A related PR implementing scrape-time recording rules has been opened in the Prometheus repo prometheus/prometheus#10529, if that is merged then we could leverage this functionality in AlloyThe text was updated successfully, but these errors were encountered: