Cleanup prometheus metrics after a reload #2726

aledbf · 2018-06-29T21:58:28Z

What this PR does / why we need it:

#2701 (comment)

Context:

we only have a list of endpoints to remove from the metrics
we need all the labels to delete a metric

Process:

after a reload and before replacing the running model, create a list of endpoints not available anymore
use prometheus registry to gather the current set of metrics
iterate the metrics
for each metric obtained in the previous step delete prometheus metric

Replaces #2716

TODO:

aledbf · 2018-06-29T21:59:02Z

@discordianfish ping. This PR uses the approach you suggested.

antoineco · 2018-06-30T00:24:20Z

internal/ingress/controller/controller.go

@@ -200,6 +200,11 @@ func (n *NGINXController) syncIngress(interface{}) error {
 		}(isFirstSync)
 	}

+	re := getRemovedEndpoints(n.runningConfig, &pcfg)
+	if len(re) > 0 {
+		go n.metricCollector.RemoveMetrics(re)


It’s important to have a test that ensures this operation is atomic. Because the goroutine is left unmonitored it could potentially run concurrently (eg. the sync fails and gets retried while metrics are still being removed).

sync fails and gets retried while metrics are still being removed

At this point, the sync already happened.
Also, syncIngress is not executed concurrently (and also is rate limited)

antoineco · 2018-06-30T00:26:07Z

internal/ingress/metric/collector/collector_test.go

-	if count != 1 {
-		t.Errorf("expected only one message from the UDP listern but %v returned", count)
+	if atomic.LoadUint64(&count) != 1 {
+		t.Errorf("expected only one message from the UDP listern but %v returned", atomic.LoadUint64(&count))


(nit) if you don’t store the value you may print something which is different from what you compared.

(typo) "listener"

antoineco · 2018-06-30T00:29:58Z

internal/ingress/metric/collector/collector.go

+func (sc *SocketCollector) RemoveMetrics(endpoints []string) {
+	mfs, err := prometheus.DefaultGatherer.Gather()
+	if err != nil {
+		glog.Errorf("error gathering metrics:", err)


(nit) It's clearer to start log messages with a capital letter to differentiate them from returned errors.

antoineco · 2018-06-30T00:31:33Z

internal/ingress/metric/collector/collector.go

+				continue
+			}
+
+			glog.Infof("Removing prometheus metric from histogram %v for endpoint %v", metricName, endpoint)


Could be verbose on the default log level.

discordianfish · 2018-07-02T09:36:17Z

This approach looks much better. Would be good to have tests though otherwise I wouldn't be confident this really keeps the metrics clean.

brancz

This approach looks much better. I still have the same synchronization concerns as the other comments, but I don't know the code base well enough whether the entry and synchronization points are chosen appropriately. I also left a comment on the test.

Overall much more confident in this though.

brancz · 2018-07-02T12:12:28Z

internal/ingress/metric/collector/collector_test.go

@@ -30,6 +30,7 @@ func TestNewUDPLogListener(t *testing.T) {
 	fn := func(message []byte) {
 		t.Logf("message: %v", string(message))
 		atomic.AddUint64(&count, 1)
+		time.Sleep(time.Millisecond)


Why is this sleep necessary? I would suggest to rather work with channels here if you are looking to test that this function is called exactly once. This seems like it will be flaky when done based on sleeping.

aledbf · 2018-07-02T13:34:17Z

This approach looks much better. Would be good to have tests though otherwise I wouldn't be confident this really keeps the metrics clean.

Today I will add tests for this. What's the recommended way to tests the metrics?

brancz · 2018-07-02T14:05:50Z

For kube-state-metrics we extracted some test utilities at some point (from the Prometheus node_exporter repo if I remember correctly), I think it's time to make them part of the client_golang project of Prometheus, but if you want to just use them you can copy them from here: https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/collectors/testutils/testutils.go

It's not super well documented, but I think looking at the tests it's relatively obvious how it works: https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/collectors/configmap_test.go

But should you have any questions I'm happy to help with anything.

aledbf · 2018-07-11T14:20:16Z

Would it be possible to add metrics for requests sent to the default backend?

already there, you will see "" as value of the label ingress

nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="0.005"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="0.01"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="0.025"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="0.05"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="0.1"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="0.25"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="0.5"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="1"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="2.5"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="5"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="10"} 0
nginx_ingress_controller_response_size_bucket{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404",le="+Inf"} 6
nginx_ingress_controller_response_size_sum{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404"} 1152
nginx_ingress_controller_response_size_count{controller_class="nginx",controller_namespace="ingress-nginx",controller_pod="nginx-ingress-controller-677b75cbf-mxp8s",host="echohea1ders",ingress="",method="GET",namespace="",path="/",service="",status="404"} 6

jpds · 2018-07-11T14:20:32Z

I'm also see wildly differing values for nginx_ingress_controller_config_hash.

aledbf · 2018-07-11T14:31:18Z

I'm also see wildly differing values for nginx_ingress_controller_config_hash.

I am checking this now

antoineco · 2018-07-11T14:46:42Z

internal/ingress/types.go

@@ -64,6 +64,9 @@ type Configuration struct {
 	// +optional
 	PassthroughBackends []*SSLPassthroughBackend `json:"passthroughBackends,omitempty"`

+	// ConfigmapChecksum contains the particular checksum of a Configuration object


As discussed on Slack, I find this name confusing and would have used BackendConfigChecksum instead.

discordianfish · 2018-07-11T16:19:19Z

cmd/nginx/main.go

 	// expose health check endpoint (/healthz)
 	healthz.InstallHandler(mux,
 		healthz.PingHealthz,
 		ic,
 	)

-	mux.Handle("/metrics", promhttp.Handler())
+	mux.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))


Handler() will instrument the handler too. If you want to keep this, use this instead:

mux.Handle( "/metrics", promhttp.InstrumentMetricHandler( reg, promhttp.HandlerFor(reg, promhttp.HandlerOpts{}), ), )

@discordianfish InstrumentMetricHandler is not available in 0.8. If safe to use prometheus/client_golang master?

I just realized this too :)
I think you can just use InstrumentHandler() from prometheus (instead of promhttp). Using master should be fine too though with the dependencies vendored anyway (it's what I did).

discordianfish · 2018-07-11T16:20:23Z

cmd/nginx/main.go

@@ -118,25 +118,20 @@ func main() {

 	conf.Client = kubeClient

-	ngx := controller.NewNGINXController(conf, fs)
+	reg := prometheus.NewRegistry()


The default registry enables the GoCollector and ProcessCollector. I think we should keep these metrics by adding:
``
reg.MustRegister(prometheus.NewGoCollector())
reg.MustRegister(prometheus.NewProcessCollector(os.Getpid(), ""))

aledbf · 2018-07-11T18:29:19Z

I'm also see wildly differing values for nginx_ingress_controller_config_hash.

@jpds @Stono found the issue with the metric.
When we start the ingress controller we create an SSL certificate for the default backend (this is done in code)
This means we have different certificate

$ diff -u cfg-from-1 cfg-from-2
--- 1	2018-07-11 14:06:23.447305757 -0400
+++ 2	2018-07-11 14:06:37.091365261 -0400
@@ -1,5 +1,5 @@
 
-# Configuration checksum: 18353149741631095525
+# Configuration checksum: 4469123391115356843
 
 # setup custom paths that do not require root access
 pid /tmp/nginx.pid;
@@ -238,7 +238,7 @@
 		
 		listen [::]:443  default_server  backlog=511 ssl http2;
 		
-		# PEM sha: 7d275866a4fde585ad6d237cb8f1d5daa92ced96
+		# PEM sha: d4d39c26a870a4da30a3f060756fc61108b33d68
 		ssl_certificate                         /etc/ingress-controller/ssl/default-fake-certificate.pem;
 		ssl_certificate_key                     /etc/ingress-controller/ssl/default-fake-certificate.pem;

I solve this ignoring two fields from the checksum

aledbf · 2018-07-11T18:29:58Z

@discordianfish thank you for the review

Stono · 2018-07-11T18:46:17Z

Great stuff finding the other issue. If you can do me another build with the fix that would be great

…

On Wed, 11 Jul 2018, 7:30 pm Manuel Alejandro de Brito Fontes, < ***@***.***> wrote: @discordianfish <https://github.com/discordianfish> thank you for the review — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2726 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABaviTwrzun-IARn2MpThCPMh9Px09W9ks5uFkQsgaJpZM4U9rXR> .

aledbf · 2018-07-11T19:05:37Z

@Stono quay.io/aledbf/nginx-ingress-controller:0.396

aledbf · 2018-07-12T15:43:38Z

Merging. Let's improve the metrics in the next iteration

antoineco · 2018-07-12T16:50:18Z

/lgtm

k8s-ci-robot · 2018-07-12T16:50:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf, antoineco

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [aledbf,antoineco]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jpds · 2018-07-12T16:50:46Z

I'm still seeing different config hashes with 0.396. Anything I can do to debug this?

aledbf · 2018-07-12T16:54:23Z

Anything I can do to debug this?

Yes, please run

kubectl exec <ingress pod 1> cat /etc/nginx/nginx.conf > 1.cfg
kubectl exec <ingress pod 2> cat /etc/nginx/nginx.conf > 2.cfg
diff -u 1.cfg 2.cfg

with the last step, check if there's a real difference

jpds · 2018-07-12T16:57:32Z

Looks like the ordering of backend IPs is different:

--- 1.cfg	2018-07-12 17:56:21.715228056 +0100
+++ 2.cfg	2018-07-12 17:56:32.414810178 +0100
@@ -1,5 +1,5 @@
 
-# Configuration checksum: 7488782747850894691
+# Configuration checksum: 5234294058761034194
 
 # setup custom paths that do not require root access
 pid /tmp/nginx.pid;
@@ -261,8 +261,8 @@ http {
 		
 		keepalive 32;
 		
-		server 100.96.142.12:7000 max_fails=0 fail_timeout=0;
 		server 100.96.140.25:7000 max_fails=0 fail_timeout=0;
+		server 100.96.142.12:7000 max_fails=0 fail_timeout=0;
 		
 	}
 	
@@ -271,8 +271,8 @@ http {
 		
 		keepalive 32;
 		
-		server 100.96.142.26:7000 max_fails=0 fail_timeout=0;
 		server 100.96.140.20:7000 max_fails=0 fail_timeout=0;
+		server 100.96.142.26:7000 max_fails=0 fail_timeout=0;
 		
 	}
...

aledbf · 2018-07-12T17:09:36Z

@jpds are you sure you are getting the nginx.conf from different pods? You should get different PEM SHAs in the diff (unless you are using a custom default ssl certificate)

jpds · 2018-07-13T07:51:11Z

@aledbf Yes, and there are different PAM SHAs (I'd only copied the top part of the diff).

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 29, 2018

aledbf force-pushed the gather branch 3 times, most recently from 873140b to 29b6374 Compare June 29, 2018 22:22

aledbf requested a review from brancz June 29, 2018 23:30

antoineco reviewed Jun 30, 2018

View reviewed changes

aledbf force-pushed the gather branch 2 times, most recently from ec77003 to 9043fc5 Compare June 30, 2018 02:21

brancz reviewed Jul 2, 2018

View reviewed changes

aledbf mentioned this pull request Jul 2, 2018

Reset Prometheus request tags on conf reload #2728

Closed

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 2, 2018

aledbf added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 2, 2018

aledbf force-pushed the gather branch 5 times, most recently from b865351 to 3d67aaa Compare July 2, 2018 18:49

antoineco suggested changes Jul 11, 2018

View reviewed changes

discordianfish reviewed Jul 11, 2018

View reviewed changes

aledbf force-pushed the gather branch from 426051a to ae26664 Compare July 11, 2018 18:26

aledbf force-pushed the gather branch from ae26664 to 11554e3 Compare July 11, 2018 19:02

aledbf mentioned this pull request Jul 12, 2018

High-cardinality Prometheus metric ingress_controller_ssl_expire_time_seconds #2773

Closed

aledbf force-pushed the gather branch 2 times, most recently from 726497e to d2175b8 Compare July 12, 2018 16:05

Refactor controller metrics interface

1542a12

aledbf force-pushed the gather branch from d2175b8 to 1542a12 Compare July 12, 2018 16:47

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 12, 2018

k8s-ci-robot merged commit 1cdd643 into kubernetes:master Jul 12, 2018

aledbf deleted the gather branch July 12, 2018 18:33

aledbf mentioned this pull request Jul 12, 2018

WIP: Roadmap to 1.0 #2480

Closed

Evesy mentioned this pull request Feb 7, 2019

Metrics for deleted Certificates should be cleaned up cert-manager/cert-manager#1332

Closed

Cleanup prometheus metrics after a reload #2726

Cleanup prometheus metrics after a reload #2726

Conversation

aledbf commented Jun 29, 2018 • edited Loading

aledbf commented Jun 29, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

discordianfish commented Jul 2, 2018

brancz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aledbf commented Jul 2, 2018 • edited Loading

brancz commented Jul 2, 2018

aledbf commented Jul 11, 2018 • edited Loading

jpds commented Jul 11, 2018

aledbf commented Jul 11, 2018

Choose a reason for hiding this comment

discordianfish Jul 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aledbf commented Jul 11, 2018

aledbf commented Jul 11, 2018

Stono commented Jul 11, 2018 via email

aledbf commented Jul 11, 2018

aledbf commented Jul 12, 2018

antoineco commented Jul 12, 2018

k8s-ci-robot commented Jul 12, 2018

jpds commented Jul 12, 2018

aledbf commented Jul 12, 2018 • edited Loading

jpds commented Jul 12, 2018

aledbf commented Jul 12, 2018 • edited Loading

jpds commented Jul 13, 2018

aledbf commented Jun 29, 2018 •

edited

Loading

aledbf commented Jul 2, 2018 •

edited

Loading

aledbf commented Jul 11, 2018 •

edited

Loading

discordianfish Jul 11, 2018 •

edited

Loading

aledbf commented Jul 12, 2018 •

edited

Loading

aledbf commented Jul 12, 2018 •

edited

Loading