-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup prometheus metrics after a reload #2726
Conversation
@discordianfish ping. This PR uses the approach you suggested. |
873140b
to
29b6374
Compare
@@ -200,6 +200,11 @@ func (n *NGINXController) syncIngress(interface{}) error { | |||
}(isFirstSync) | |||
} | |||
|
|||
re := getRemovedEndpoints(n.runningConfig, &pcfg) | |||
if len(re) > 0 { | |||
go n.metricCollector.RemoveMetrics(re) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s important to have a test that ensures this operation is atomic. Because the goroutine is left unmonitored it could potentially run concurrently (eg. the sync fails and gets retried while metrics are still being removed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync fails and gets retried while metrics are still being removed
At this point, the sync already happened.
Also, syncIngress is not executed concurrently (and also is rate limited)
if count != 1 { | ||
t.Errorf("expected only one message from the UDP listern but %v returned", count) | ||
if atomic.LoadUint64(&count) != 1 { | ||
t.Errorf("expected only one message from the UDP listern but %v returned", atomic.LoadUint64(&count)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) if you don’t store the value you may print something which is different from what you compared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(typo) "listener"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
func (sc *SocketCollector) RemoveMetrics(endpoints []string) { | ||
mfs, err := prometheus.DefaultGatherer.Gather() | ||
if err != nil { | ||
glog.Errorf("error gathering metrics:", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) It's clearer to start log messages with a capital letter to differentiate them from returned errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
continue | ||
} | ||
|
||
glog.Infof("Removing prometheus metric from histogram %v for endpoint %v", metricName, endpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be verbose on the default log level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ec77003
to
9043fc5
Compare
This approach looks much better. Would be good to have tests though otherwise I wouldn't be confident this really keeps the metrics clean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach looks much better. I still have the same synchronization concerns as the other comments, but I don't know the code base well enough whether the entry and synchronization points are chosen appropriately. I also left a comment on the test.
Overall much more confident in this though.
@@ -30,6 +30,7 @@ func TestNewUDPLogListener(t *testing.T) { | |||
fn := func(message []byte) { | |||
t.Logf("message: %v", string(message)) | |||
atomic.AddUint64(&count, 1) | |||
time.Sleep(time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this sleep necessary? I would suggest to rather work with channels here if you are looking to test that this function is called exactly once. This seems like it will be flaky when done based on sleeping.
Today I will add tests for this. What's the recommended way to tests the metrics? |
For kube-state-metrics we extracted some test utilities at some point (from the Prometheus node_exporter repo if I remember correctly), I think it's time to make them part of the client_golang project of Prometheus, but if you want to just use them you can copy them from here: https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/collectors/testutils/testutils.go It's not super well documented, but I think looking at the tests it's relatively obvious how it works: https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/collectors/configmap_test.go But should you have any questions I'm happy to help with anything. |
b865351
to
3d67aaa
Compare
already there, you will see "" as value of the label
|
I'm also see wildly differing values for |
I am checking this now |
internal/ingress/types.go
Outdated
@@ -64,6 +64,9 @@ type Configuration struct { | |||
// +optional | |||
PassthroughBackends []*SSLPassthroughBackend `json:"passthroughBackends,omitempty"` | |||
|
|||
// ConfigmapChecksum contains the particular checksum of a Configuration object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed on Slack, I find this name confusing and would have used BackendConfigChecksum
instead.
// expose health check endpoint (/healthz) | ||
healthz.InstallHandler(mux, | ||
healthz.PingHealthz, | ||
ic, | ||
) | ||
|
||
mux.Handle("/metrics", promhttp.Handler()) | ||
mux.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handler() will instrument the handler too. If you want to keep this, use this instead:
mux.Handle(
"/metrics",
promhttp.InstrumentMetricHandler(
reg,
promhttp.HandlerFor(reg, promhttp.HandlerOpts{}),
),
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@discordianfish InstrumentMetricHandler
is not available in 0.8. If safe to use prometheus/client_golang master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized this too :)
I think you can just use InstrumentHandler() from prometheus
(instead of promhttp
). Using master should be fine too though with the dependencies vendored anyway (it's what I did).
@@ -118,25 +118,20 @@ func main() { | |||
|
|||
conf.Client = kubeClient | |||
|
|||
ngx := controller.NewNGINXController(conf, fs) | |||
reg := prometheus.NewRegistry() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default registry enables the GoCollector and ProcessCollector. I think we should keep these metrics by adding:
``
reg.MustRegister(prometheus.NewGoCollector())
reg.MustRegister(prometheus.NewProcessCollector(os.Getpid(), ""))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@jpds @Stono found the issue with the metric. $ diff -u cfg-from-1 cfg-from-2
--- 1 2018-07-11 14:06:23.447305757 -0400
+++ 2 2018-07-11 14:06:37.091365261 -0400
@@ -1,5 +1,5 @@
-# Configuration checksum: 18353149741631095525
+# Configuration checksum: 4469123391115356843
# setup custom paths that do not require root access
pid /tmp/nginx.pid;
@@ -238,7 +238,7 @@
listen [::]:443 default_server backlog=511 ssl http2;
- # PEM sha: 7d275866a4fde585ad6d237cb8f1d5daa92ced96
+ # PEM sha: d4d39c26a870a4da30a3f060756fc61108b33d68
ssl_certificate /etc/ingress-controller/ssl/default-fake-certificate.pem;
ssl_certificate_key /etc/ingress-controller/ssl/default-fake-certificate.pem; I solve this ignoring two fields from the checksum |
@discordianfish thank you for the review |
Great stuff finding the other issue. If you can do me another build with
the fix that would be great
…On Wed, 11 Jul 2018, 7:30 pm Manuel Alejandro de Brito Fontes, < ***@***.***> wrote:
@discordianfish <https://github.com/discordianfish> thank you for the
review
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2726 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABaviTwrzun-IARn2MpThCPMh9Px09W9ks5uFkQsgaJpZM4U9rXR>
.
|
@Stono |
Merging. Let's improve the metrics in the next iteration |
726497e
to
d2175b8
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aledbf, antoineco The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I'm still seeing different config hashes with |
Yes, please run
with the last step, check if there's a real difference |
Looks like the ordering of backend IPs is different:
|
@jpds are you sure you are getting the nginx.conf from different pods? You should get different PEM SHAs in the diff (unless you are using a custom default ssl certificate) |
@aledbf Yes, and there are different PAM SHAs (I'd only copied the top part of the diff). |
What this PR does / why we need it:
#2701 (comment)
Context:
Process:
Replaces #2716
TODO:
nginx_ingress_controller_config_hash{class="",namespace="",pod=""}