Add prometheus and friends #27

anguslees · 2018-04-17T10:50:38Z

prometheus
alertmanager
node-exporter
kube-state-metrics
heapster

Also includes various improvements to the oauth2-proxy setup (in particular, switches to same-domain oauth2 redirects)

Fixes #24

arapulido · 2018-04-17T11:02:12Z

@jjo Could you review this, please?

jjo

SGTM in general, some nits.

jjo · 2018-04-18T15:15:25Z

manifests/components/heapster.jsonnet

+local kube = import "kube.libsonnet";
+
+local arch = "amd64";
+local version = "v1.4.3";


Why not latest v1.5.2 ?
If there's some explicit reason (AKS compat / alike?), I'd suggest adding a comment on why.

no known reason (I copied a lot of this from some existing jsonnet files). Will update.
Done.

jjo · 2018-04-18T15:17:32Z

manifests/components/nginx-ingress.jsonnet

@@ -191,7 +177,7 @@ local kube = import "kube.libsonnet";
          terminationGracePeriodSeconds: 60,
          containers_+: {
            default: kube.Container("nginx") {
-              image: "quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.9.0",
+              image: "quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.12.0",


Latest: 0.13.0 since 2days ago :), which has this relevant addition re: kcm:

Add NoAuthLocations and default it to "/.well-known/acme-challenge" ( Add NoAuthLocations and default it to "/.well-known/acme-challenge" kubernetes/ingress-nginx#2243 )

jjo · 2018-04-18T15:32:44Z

manifests/components/oauth2-proxy.jsonnet

+        spec+: {
+          containers_+: {
+            proxy: kube.Container("oauth2-proxy") {
+              image: "a5huynh/oauth2_proxy:2.2.1",


Given how security critical is this image, we're internally using one multi-stage built from sources, fine to keep it as-is while testing, but let's work on adding it to our kube-prod-containers for release, I'll submit a that PR.

Fwiw also that 2.2.1 version is misleading, as sources (only) have v2.2, with the extra .1 added by the guy who built it afaicf.

Agreed on all points. The latest oauth2-proxy release is quite a while ago too, so most images I could find also haven't been updated in ages (ie: the base OS portions of the image also haven't been updated since the oauth2-proxy release). I went with this image over alternatives solely because it was the same image used by the helm chart.

jjo · 2018-04-18T15:39:59Z

manifests/components/oauth2-proxy.jsonnet

+                "cookie-refresh": "3h",
+                "set-xauthrequest": true,
+                "tls-cert": "",
+                upstream: "file:///dev/null",


Expose upstream at obj root, to ease its overriding, see e.g. sample run output, showing a kinda nonsense /null urlpath mapping

$ docker run -e OAUTH2_PROXY_CLIENT_ID=x -e OAUTH2_PROXY_CLIENT_SECRET=x -e OAUTH2_PROXY_COOKIE_SECRET=x -it a5huynh/oauth2_proxy:2.1 --email-domain="" --upstream="file://dev/null" 2018/04/18 15:38:39 oauthproxy.go:143: mapping path "/null" => file system "/null" 2018/04/18 15:38:39 oauthproxy.go:157: OAuthProxy configured for Google Client ID: x 2018/04/18 15:38:39 oauthproxy.go:167: Cookie settings: name:_oauth2_proxy secure(https):true httponly:true expiry:168h0m0s domain:<default> refresh:disabled 2018/04/18 15:38:39 http.go:49: HTTP: listening on 127.0.0.1:4180

Yes, I found this odd - there doesn't seem to be a way to "properly" turn off upstream afaics. The chart (eg) just uses --upstream=file:///dev/null as I have, without worry about the fact that it actually creates a "/null" path on the oauth2 server.

Note that as I have it setup here, I'm using it as an auth handler for nginx - oauth2-proxy is exposed to the internet, but upstream HTTP requests/replies do not actually get proxied through oauth2-proxy. As a practical result, this means there's only one instance of oauth2-proxy running, regardless of how many actual k8s services use that oauth2 configuration.
.. consequently I don't think of "upstream" as being an exposed/configurable parameter here .. (any more so than the other options like --tls-cert)

Ok, let's revisit when we start actually plumbing services behind it.

jjo · 2018-04-18T18:15:57Z

manifests/components/prometheus-config.jsonnet

+        },
+        {
+          target_label: "__address__",
+          replacement: "blackbox-exporter.example.com:9115",


Leave it as blackbox:9115 with a TODO() note to later add a
blackbox deploy in the same namespace.

The other, likely more correct approach in general, would be to
have a well-known field e.g. kubeprod_enable: true that we
could set to false then massage to nullify from the container
object (scrape_configs in this case ) + std.prune(), so that
we can more easily stage new entries.

Done. I changed to "blackbox-exporter:9115", and left a TODO in the toplevel prometheus.jsonnet.

jjo · 2018-04-18T18:16:15Z

manifests/components/prometheus-config.jsonnet

+        },
+        {
+          target_label: "__address__",
+          replacement: "blackbox-exporter.example.com:9115",


Ditto above.

jjo · 2018-04-18T18:22:20Z

manifests/lib/kube.libsonnet

+        // they're no-ops).
+        // In particular annotations={} is apparently a "change",
+        // since the comparison is ignorant of defaults.
+        std.prune($.PersistentVolumeClaim($.hyphenate(kv[0])) + {apiVersion:: null, kind:: null} + kv[1])


jjo

LGTM, thanks !
fwiw jenkins is tossing on this PR, didn't dig into why tho.

jjo · 2018-04-19T18:13:55Z

BTW, a general note about ab-using kube-system namespace for resources that are obviously nope:

imo it's wrong, maybe use kube-prod-system or alike ? could also go for per-type: ingress, logging, monitoring, which would be my personal preference in general, but given the kinda "all-in-one(-shot)" kubeprod approach, a single namespace would be better.
it does have an operational impact: for example kops upgrade checks for kube-system to be ready/settled (essentially all pods in Running) before replacing each node, in Bitnami is already a PITA for us having elasticsearch and kibana there, as these biggies can crawl down upgrades and/or make them fail on timeout (forcing manual re-trying).

anguslees · 2018-04-20T02:19:14Z

fwiw jenkins is tossing on this PR, didn't dig into why tho.

I'm dealing with some OOM issues on the jenkins slaves, made worse (for me) by the default limitrange. I think I've worked out why my earlier attempts to set explicit resource limits weren't working..

BTW, a general note about ab-using kube-system namespace for resources that are obviously nope

Yes. My thinking was that a few of the things we need to modify have to be in kube-system (for example fixing up the kube-dashboard for installs that don't do that correctly), so it would be confusing to pretend we were all contained somewhere else. My intention was to always use custom serviceAccounts so essentially every "app" is isolated uniquely, but (as we know) there's a number of places where isolation becomes effectively namespace-level - and I completely agree that it's messy including big components like elasticsearch in kube-system. Very open to changing to a dedicated namespace (or group of namespaces) in the near future.

I didn't know kops checked for kube-system. It's a pity it doesn't just do a drain and rely on PodDisruptionBudget since there's nothing special about kube-system that couldn't also apply to other business critical namespaces equally as strongly...

Once upon a time, jsonnet `{foo+: {a: 'b}}` required `super.foo` to exist. That hasn't been the case for several versions now. These empty `annotations: {}` upset StatefulSet's buggy change detection, which we can mostly avoid by just removing them.

- prometheus - alertmanager - node-exporter - kube-state-metrics - heapster

anguslees added the enhancement New feature or request label Apr 17, 2018

anguslees self-assigned this Apr 17, 2018

arapulido requested a review from jjo April 17, 2018 11:01

jjo reviewed Apr 18, 2018

View reviewed changes

jjo approved these changes Apr 19, 2018

View reviewed changes

anguslees added 4 commits April 20, 2018 12:56

Add oauth2-proxy

96fccc7

Add prometheus and friends

8949657

- prometheus - alertmanager - node-exporter - kube-state-metrics - heapster

Address jjo review comments

4193af0

anguslees force-pushed the oauth2 branch from e073280 to 4193af0 Compare April 20, 2018 02:56

anguslees merged commit 4d209e0 into vmware-archive:master Apr 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prometheus and friends #27

Add prometheus and friends #27

anguslees commented Apr 17, 2018 •

edited

Loading

arapulido commented Apr 17, 2018

jjo left a comment

jjo Apr 18, 2018 •

edited

Loading

anguslees Apr 18, 2018 •

edited

Loading

jjo Apr 18, 2018

anguslees Apr 19, 2018

jjo Apr 18, 2018 •

edited

Loading

anguslees Apr 19, 2018

jjo Apr 18, 2018

anguslees Apr 19, 2018 •

edited

Loading

jjo Apr 19, 2018

jjo Apr 18, 2018

anguslees Apr 19, 2018

jjo Apr 18, 2018

anguslees Apr 19, 2018

jjo Apr 18, 2018

jjo left a comment

jjo commented Apr 19, 2018

anguslees commented Apr 20, 2018

Add prometheus and friends #27

Add prometheus and friends #27

Conversation

anguslees commented Apr 17, 2018 • edited Loading

arapulido commented Apr 17, 2018

jjo left a comment

Choose a reason for hiding this comment

jjo Apr 18, 2018 • edited Loading

Choose a reason for hiding this comment

anguslees Apr 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjo Apr 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anguslees Apr 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjo left a comment

Choose a reason for hiding this comment

jjo commented Apr 19, 2018

anguslees commented Apr 20, 2018

anguslees commented Apr 17, 2018 •

edited

Loading

jjo Apr 18, 2018 •

edited

Loading

anguslees Apr 18, 2018 •

edited

Loading

jjo Apr 18, 2018 •

edited

Loading

anguslees Apr 19, 2018 •

edited

Loading