-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ruler: Added support for strict rule groups that does not allow partial_response #970
Conversation
159d0a8
to
51f7206
Compare
cmd/thanos/rule.go
Outdated
}) | ||
} | ||
|
||
var refOpts *rules.ManagerOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this refOpts part will be improved. Essentially I will use prometheus.WrapRegistererWithPrefix
or label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice idea! Everything LGTM except for minor things
cmd/thanos/rule.go
Outdated
|
||
strictMgr = rules.NewManager(&opts) | ||
g.Add(func() error { | ||
strictMgr.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@povilasv Before merging, I plan to change partial_response from bool flag to |
51f7206
to
bdd80ff
Compare
bdd80ff
to
fa76c55
Compare
fa76c55
to
3107832
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, this is really needed feature. Thanks for doing this! 🎉
Generally ok from me just some minor comments/questions
/// | ||
/// This is especially useful for any rule/alert evaluations on top of StoreAPI which usually does not tolerate partial | ||
/// errors. | ||
ABORT = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other strategies planned? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet but it leaves spaces for other options at some point - extensibility. And is more verbose then just DISABLED (:
pkg/rule/rule.go
Outdated
return nil | ||
} | ||
|
||
// Update updates files into two different rule managers strict and non strict based on special field in RuleGroup file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: actually they don't have to be just two since it's done generally for the strategy type enum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obsolete comment, good point, thx
cmd/thanos/rule.go
Outdated
ctx = tracing.ContextWithTracer(ctx, tracer) | ||
|
||
opts := opts | ||
opts.Registerer = extprom.WrapRegistererWith(prometheus.Labels{"type": "non-strict"}, reg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe would be more consistent to have the labels consistent with the strategies so {"type": "warn"}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point, forgot to update this, awesome
docs/components/rule.md
Outdated
|
||
The data of each rule node can be labeled to satisfy the clusters labeling scheme. High-availability pairs can be run in parallel and should be distinguished by the designated replica label, just like regular Prometheus servers. | ||
You can think of Rule as a simplified Prometheus that does not require a sidecar and does not collect and do PromQL evaluation (no QueryAPI). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be more explicit
You can think of Rule as a simplified Prometheus that does not require a sidecar and does not collect and do PromQL evaluation (no QueryAPI). | |
You can think of Rule as a simplified Prometheus that does not require a sidecar and does not scrape and do PromQL evaluation (no QueryAPI). |
docs/components/rule.md
Outdated
unavailability is the key. | ||
|
||
Note that Ruler will never trigger alert on it's own, so *no alert will be triggered if the query fails* without extra (mandatory) | ||
meta monitoring. This is exactly the same Alertmanager project. Even though there might be internal error Alertmanager will only produce metric, will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you are trying to say with this sentence, but maybe it's just me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well maybe it's too far, you are right. removed
docs/components/rule.md
Outdated
// TODO(bwplotka) TO BE EXTENDED. | ||
|
||
* `thanos_alert_sender_alerts_dropped_total` to check if our connection to alertmanager works. | ||
* `prometheus_rule_group_last_duration_seconds < prometheus_rule_group_interval_seconds` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add prometheus_rule_evaluation_failures_total
right away that's pretty essential alert also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to work on this and ensure that ruler has all of this
docs/components/rule.md
Outdated
|
||
In case of Ruler in HA you need to make sure you have following labelling setup: | ||
|
||
* Labels that identifies the HA group ruler is in and replica label with different value for each ruler instance, e.g: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Labels that identifies the HA group ruler is in and replica label with different value for each ruler instance, e.g: | |
* Labels that identifies the HA group ruler and replica label with different value for each ruler instance, e.g: |
And from me it's +1 for disabling the partial response by default since it is feature of thanos and it could cause issues after migration from proetheus for users that doesn't know about this. |
3107832
to
ec12751
Compare
I would leave this decision for later AND with wider discussion - it's not that straightforward |
7c31aba
to
affe7ed
Compare
d88a0d0
to
dad1aa9
Compare
Nice moment for ruler docs: https://twitter.com/t_colgate/status/1110874738866438149?s=19 |
@tcolgate if you want to review / give us feedback if that is clear (the docs part) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor nits, but I really like this. Good work on the docs 🥇
docs/components/rule.md
Outdated
|
||
_**NOTE:** The rule component is experimental since it has conceptual tradeoffs that might not be favorable for most use cases. It is recommended to keep deploying rules to the relevant Prometheus servers._ | ||
_**NOTE:** It is recommended to keep deploying rules inside the relevant Prometheus servers locally. Use ruler only on specific cases. Read details[below](rule.md#Risk) why._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
to evaluate the query, that is not fast enough to fill the rule. This might indicate other problems like slow StoreAPis or | ||
too complex query expression in rule. | ||
|
||
* `thanos_rule_evaluation_with_warnings_total`. If you choose to use Rules and Alerts with [partial response strategy](rule.md#PartialResponse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice I will add these metrics to our grafana dashboard :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
see WARN log level. This might suggest that those evaluations returns partial response and might be or not accurate. | ||
|
||
Those metrics are important for vanilla Prometheus as well, but even more important when we rely on (sometimes WAN) network. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe here we should also add about the new timeout in store.response-timeout
, i.e. if you have a slow Store API it will time them out and you will see partial response or log errors in the abort case?
pkg/promclient/promclient.go
Outdated
// QueryInstant performs instant query and returns results in model.Vector type. | ||
func QueryInstant(ctx context.Context, logger log.Logger, base *url.URL, query string, t time.Time, dedup bool) (model.Vector, error) { | ||
func QueryInstant(ctx context.Context, logger log.Logger, base *url.URL, query string, t time.Time, extra ThanosQueryParams) (model.Vector, []string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe t time.Time
and query
should go into ThanosQueryParams
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disagree as this function does not make any sense without that -> I think it makes sense to separate custom "Thanos" params, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ThanosQueryParams
should be QueryOptions
and then it's all good?
pkg/rule/api/v1.go
Outdated
@@ -5,32 +5,36 @@ import ( | |||
"net/http" | |||
"time" | |||
|
|||
thanosrule "github.com/improbable-eng/thanos/pkg/rule" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goimports please :)
Hi All,
It also feels like the advice is overly cautious.
It feels like it might be better to suggest that most rules are safe to deploy to the ruler. If you have rules and alerts that are absolutely critical, and you know they fit with the historical retention of that instance, then those rules can/should be deployed directly to prometheus. |
Thanks for you feedback it is super useful @tcolgate! BTW can you jump on our slack? (link in README). Would be easier to discuss!
I disagree with this and the key here is just a good deployment model for your rules into various pieces that can be easily solved outside of Thanos (config managament).
So how many layers of nested rules you do that it's hard to tell? I think doing more than 2 is anti pattern. This is because everything is tightly coupled and dependent on each other. And even if you have 100 nested rules on top of each other (100 layers) you HAVE to put them in the same rule group otherwise you will get inconisistent results (group can run concurrently, but all within single group is sequential). BTW I think this is more question to Anyway if they are on the same group it easy to tell what retention is needed. AND rules aggregating more than 5m is not that often spotted and more than 24h are super rare from my expierience. However I cannot see your user's rules so it might be that you have differen use case. Putting most rules locally is the RECOMMENDATION, not a blocker. However from what I can see you really want to use Thanos Ruler MOSTLY to solve lack of proper configuration management. If you miss that part, speak to us /
TBH It's not that bad as long as you use Golang and existing libraries for parsing PromQL, but sure depends how much time you want to spend on it (:
I don't get that part, what do you mean
Agree 👍
It depends but I tend to disagree. Evaluation in Prom has been there for 5+ years vs 1 year Thanos project WITH distributed system problems. I trust my code, but still I think the former is less risky (e.g I don't trust network). ^^
Agree, but what should be the flow for Ruler? Should it retry? should it wait for another eval interval (it does the latter)
Can agree with this, but I would rather say it's the distributed model vs local Prometheus. Think about this as advise for typical users, sometimes not really familiar with high scale problems yet. Again I would treat is as a good advice, but totally ok to ignore if you know what you are doing. @tcolgate Can you help us rephrase this so it will sound better?
Sure, I think the key part is to understand the risk which I described. Do you think the risk part is clear enough? |
The difficulty here is that my users are the ones that define the groups. So they have to understand the reasoning for the grouping. Rules which would need to be separated here are rare enough that such knowledge is likely to be lost within the lore of the wider user base.
We have some pretty deep nestings, these are now properly split into groups. Even on prom1 we never had evalls over run so whilst it took a while for deep rules to converge to the right value, concurrency tended not to be a problem.
We have quite a few that are looking for 1 day (usually users looking to track increases over a 24h period), lots of >1h. We have a few SLO related metrics that look for 30d! The difficulty is that we intentionally avoid "baby sitting" users writing queries and alerts, so, as far as is reasonable, stuff should Just Work.
We already embed large chunks of prom for verify rules, so this has occurred to me, but it takes more effort than I wanted to get into right away (e.g. it only just occurred to me that offsets in queries will also need to be taken account of).
Unless I'm misunderstanding, ruler will spread the query load out over thanos queriers? That seems preferable to leaving it all up to the scraping prom instance.
Maybe I trust your code more that you do :) Failing that, I'm getting enough pain from prom instability to give it a shot (not that prom is unstable, but it is when running large memory hungry instances in kube, as I said, I'm confident I can shard thanos).
I think my preference would be to remove the NOTE, and have a more verbose paragraph elaborating the tradeoffs. I'll have a think. I think overall my objection to the current phrasing is that for many use cases ruler (and even thanos), exist precisely to resolve a config management situation, the config management issue of running loads of standalone proms with separate config. (I'm already running 9, I'm using thanos to simplify things and make them more stable).
Possibly |
Sure but most of data (initial hours) are on Prom so you would use Prom to fetch those anyway.
Let's think about this more, this PR is extremely huge - sorry for that - that is antipattern but it's rather the code I want to improve then docs, so I would ship this and iterate on live docs to improve them. I am not sure about the config managment aspects. If we will move border and agree on that we would endup getting (valid) request to implement config distribution to all componetns, rotating, hot realoading, compressing, signing XD everything. We need to put a border somewhere, maybe there is a place for another tool to fix confg management. It might just a side effect of Thanos that it helps to organize your config better. cc @GiedriusS @povilasv @domgreen on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't use ruler myself so it's a bit harder for me but at a first glance the code looks clean and the comments by @povilasv seem on point! 👍 I was most worried backward compatibility and the code currently maintains that. Will need to take a deeper look again before merging it
I personally agree with @bwplotka on this. But as mentioned above I wouldn't block this PR with this discussion. Not saying we shouldn't discuss it further but in separate PR/thread/channel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I really like the feature and the work on the docs! Good job.
ed6c039
to
42b6d6b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Team for awesome feedback!
CC @FUSAKLA @GiedriusS @povilasv addressed all. Added more info about store.response-timeouts
and PartialResponseStrategy being a tradeoff between accuracy(consistency) and availability.
PTAL (:
} | ||
|
||
// NOTE: For Rule default is abort as this is recommended for alerting. | ||
p = storepb.PartialResponseStrategy_value[storepb.PartialResponseStrategy_ABORT.String()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GiedriusS in terms of backward compatibility we break it here. By default you will have abort
strategy. (it was "enabled"/"warning" previously only). Added item to CHANGELOG as well.
42b6d6b
to
9adcef1
Compare
609d78d
to
2d67fa8
Compare
b := &bytes.Buffer{} | ||
if resp.Header.Get("Content-Encoding") != "gzip" { | ||
_, err = io.Copy(b, resp.Body) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to close resp.Body
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
funny fact, I copied it (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from Prometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
433 line is already defering close
return err | ||
} | ||
_, err = io.Copy(b, gzipr) | ||
_ = gzipr.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also don't we need to close resp.Body
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, copied, but maybe badly, relooking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
433 line is already defering close
…al_response Signed-off-by: Bartek Plotka <[email protected]>
2d67fa8
to
05f8e93
Compare
Addressed all @povilasv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -227,13 +228,23 @@ func runRule( | |||
Name: "thanos_rule_loaded_rules", | |||
Help: "Loaded rules partitioned by file and group", | |||
}, | |||
[]string{"file", "group"}, | |||
[]string{"part_resp_strategy", "file", "group"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not consistent ):
Changes:
CC @povilasv
CC @mjd95 @improbable-ludwik
Signed-off-by: Bartek Plotka [email protected]