-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(discovery): configure sharding every time MetricsHandler.Run runs #2478
fix(discovery): configure sharding every time MetricsHandler.Run runs #2478
Conversation
Welcome @wallee94! |
Signed-off-by: Walther Lee <[email protected]>
102f391
to
dec7a1b
Compare
/assign @CatherineF-dev |
Hello, this code has been around for five years. Why is it only now experiencing this issue? Could there be a another underlying problem? If we can figure out how to only Run once inside https://github.com/kubernetes/kube-state-metrics/pull/1851/files, it will be the fix.
|
Sorry, I mentioned it in a thread in the kube-state-metrics Slack channel and forgot to put it here. The bug isn't exactly in metrics_handler if Removing the validation in |
Thanks for spotting this! I am thinking whether we should run it only once. |
That's a good point, I can look into that. I think On the other hand, if |
I've made some changes to use |
Signed-off-by: Walther Lee <[email protected]>
@CatherineF-dev I added new changes to run I deployed the change to a few clusters and it seems to be working. This is the watch rate after adding a new CRD to the cluster: I see the event, then a brief drop, which is when ksm is populating the cache after the reconfigure, and then it comes back to normal. All the metrics look good as well. |
Signed-off-by: Walther Lee <[email protected]>
0616572
to
4aced25
Compare
Signed-off-by: Walther Lee <[email protected]>
Signed-off-by: Walther Lee <[email protected]>
Signed-off-by: Walther Lee <[email protected]>
A summary of changes per file to help with the review:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small typo in the comments, otherwise looks good to me. Thanks for the contribution!
/lgtm
and
/hold
for others to review.
Is there an ETA for when this will get merged? |
/lgtm I'll still ping the other maintainers to review, currently everyone seems to be busy. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mrueg, wallee94 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel continuing here as no other reviews came in. Thanks for the debugging and your contribution! |
/hold cancel |
What this PR does / why we need it:
I see the following issue in a ksm deployment with custom CRD config enabled. Whenever a new CRD event is added to the cache, ksm stops watching and metrics stop reflecting new changes.
In the graph, ksm recovers after updating the Statefulset with any change. The metrics are
rate(kube_state_metrics_watch_total)
and the counterkube_state_metrics_custom_resource_state_add_events_total
without rate.Looking into the code, the problem seems to be the validation
shardingUnchanged
inAddFunc
(here).Without a CRD config,
MetricsHandler.Run
runs only once, and the varsm.curShard
andm.curTotalShards
are initiallynil
, which makesshardingUnchanged = false
(here).However, if a CRD config is present, discovery runs
MetricsHandler.Run
every time a CRD event is detected (here). If the Statefulset number of replicas/shards didn't change, the new CRD event will cancel the old metrics handler, but won't initiate a new one becauseshardingUnchanged = true
inAddFunc
.This change removes the checkshardingUnchanged
in theAddFunc
event handler. I don't think it's necessary because, in most cases, it's only called when the informer is synced at the end ofMetricsHandler.Run
.This change updates
CRDiscoverer.PollForCacheUpdates
to rebuild the metrics writers in the already running metrics handler, instead of running a new one every time a CRD event occurs.How does this change affect the cardinality of KSM:
No change in cardinality.
Which issue(s) this PR fixes:
Fixes #2372