Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPSTRAT-139: Ingress operator dashboard #1431

Closed
wants to merge 17 commits into from
248 changes: 248 additions & 0 deletions enhancements/ingress/ingress-dashboard.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
---
title: ingress-dashboard
authors:
- "@jotak"
- "@OlivierCazade"
reviewers:
- "@Miciah"
- "@candita"
approvers:
- "@Miciah"
- "@candita"
api-approvers:
- "@deads2k"
creation-date: 2023-06-28
last-updated: 2023-06-28
tracking-link:
- "https://issues.redhat.com/browse/OCPSTRAT-139"
- "https://issues.redhat.com/browse/NETOBSERV-1052"
see-also:
replaces:
superseded-by:
---

# Ingress dashboard creation for the OpenShift Console

## Release Signoff Checklist

- [x] Enhancement is `implementable`.
- [x] Design details are appropriately documented from clear requirements.
- [x] Test plan is defined.
- [ ] graduation criteria for dev preview, tech preview, GA: N/A
- [ ] User-Facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/).

## Summary

The goal is to add a new dashboard in the OpenShift Console (in menu “Observe” > “Dashboards”), dedicated to metrics related to Ingress.
Such dashboards are deployed through configmaps, so a new controller will be added to the ingress operator to manage this configmap.

Ingress components, such as HAProxy, already provide some metrics that are exposed and collected by Prometheus / Cluster Monitoring. Administrators should be able to get a consolidated view, using a subset of these metrics, to get a quick overview of the cluster state. This enhancement proposal is part of a wider initiative to improve the observability of networking components (see https://issues.redhat.com/browse/OCPSTRAT-139).

## Motivation

While ingress related metrics already exist and are accessible in the OpenShift Console (via the menu “Observe” > “Metrics”), there is no consolidated view presenting a summary of them. There are existing dashboards in other areas (such as etcd, compute resources, etc.), but networking today is less represented there, despite its importance for monitoring and troubleshooting.
Metrics such as HAProxy error rates or latency can be made more visible by promoting them in a dashboard.

In addition, product management has shown interest in providing and making more visible some cluster-wide statistics, such as the number of routes and shards in use.

More details on the new dashboard content is provided below.

### Goals

1. Design and implement a new dashboard using a selection of metrics listed below.
2. Update the Ingress Operator to deploy this dashboard configmap, making it accessible for the Cluster Monitoring stack.

### Non-Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not directly related to this enhancement proposal but we had a discussion [1] a few weeks ago with @candita and @Miciah about the namespace/pod/service labels. TL;DR: HAProxy metrics have labels identifying the router pod (namespace , pod, service) and labels identifying the service pods (the same but with the exported_ prefix). The current situation forbids project users to see the metrics for their routes (the exported_namespace label should be stored as namespace label) and it would also prevent having a router dashboard in the developer console.
If developer dashboards aren't in the scope of this proposal, it should at least be noted here.

[1] https://redhat-internal.slack.com/archives/CCH60A77E/p1664366340954199

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's mentioned in several places that this is for the admin view, but I'll add a non-goal about devs for more clarity.


- This work is not intended to provide a comprehensive set of metrics all at once. Instead, the intent is to "start small", setting in place all the mechanisms in code, and iterate later based on feedback to add or amend elements in the dashboard without any changes outside of the dashboard json definition file.
- This enhancement does not include any new metric creation or exposition: only already available metrics are considered. If discussions lead to consider the creation of new metrics, this could be the purpose of a follow-up enhancement.

### User Stories

> As a cluster administrator, I want to get a quick overview of general cluster statistics, such as the number of routes or shards in use.

> As a cluster administrator, I want to get a quick insight in incoming traffic statistics, such as on latency and HTTP errors.

## Proposal

To make a new dashboard discoverable by Cluster Monitoring, a `ConfigMap` needs to be created in the namespace `openshift-config-managed`, containing a static dashboard definition in Grafana format (JSON). The dashboard datasource has to be Cluster Monitoring's Prometheus.
jotak marked this conversation as resolved.
Show resolved Hide resolved

The Ingress Operator is responsible for creating and reconciling this `ConfigMap`. We assume all metrics used in the dashboard are present unconditionally, which allows us to create a static dashboard unconditionally as well.
jotak marked this conversation as resolved.
Show resolved Hide resolved
The Ingress Operator embeds a static json manifest for the full dashboard.
If the operator detects any change between the deployed dashboard and the embedded one, the operator replaces the deployed dashboard with the embedded one.

### Dashboard content

At the top, a summary row presenting global cluster statistics as text panels:

- Total current byte rate in (aggregated across all routes/shards): _sum(rate(haproxy_server_bytes_in_total[1m]))_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want at least 4 x the scrape interval which I believe is 30s for the routers. The same comment applies to the other rate queries.

Suggested change
- Total current byte rate in (aggregated across all routes/shards): _sum(rate(haproxy_server_bytes_in_total[1m]))_
- Total current byte rate in (aggregated across all routes/shards): _sum(rate(haproxy_server_bytes_in_total[2m]))_

- Total current byte rate out (aggregated across all routes/shards): _sum(rate(haproxy_server_bytes_out_total[1m]))_
- Total current number of routes: _count(count(haproxy_server_up == 1) by (route))_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to count by(...) (...) if you're counting all eventually.

Suggested change
- Total current number of routes: _count(count(haproxy_server_up == 1) by (route))_
- Total current number of routes: _count(haproxy_server_up == 1)_

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this isn't the same thing, the one you gave doesn't count just routes
Anyway, this query was removed from the dashboard after iterating, I updated the EP

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the request there was not up to date.
We want to count by service but only once per route so we did this request:

count(count(haproxy_server_up == 1) by (route, service)) by (service)

- Total current number of ingress controllers: _count(count(haproxy_server_up == 1) by (pod))_
jotak marked this conversation as resolved.
Show resolved Hide resolved

Below this top summary, more detailed time-series panels. Each of these panel come in two flavours: aggregated per route, and aggregated per controller instance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need by route and router instance?


- Byte rate in, per route or per controller instance: _sum(rate(haproxy_server_bytes_in_total[1m])) by (route)_ or _sum(rate(haproxy_server_bytes_in_total[1m])) by (pod)_

- Byte rate out, per route or per controller instance: _sum(rate(haproxy_server_bytes_out_total[1m])) by (route)_ or _sum(rate(haproxy_server_bytes_out_total[1m])) by (pod)_

- Response error rate, per route or per controller instance: _sum(irate(haproxy_server_response_errors_total[180s])) by (route)_ or _sum(irate(haproxy_server_response_errors_total[180s])) by (pod)_

- Average response latency, per route or per controller instance: _avg(haproxy_server_http_average_response_latency_milliseconds != 0) by (route)_ or _avg(haproxy_server_http_average_response_latency_milliseconds != 0) by (pod)_
jotak marked this conversation as resolved.
Show resolved Hide resolved
jotak marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I would normalize the query result to seconds.

Copy link
Contributor Author

@jotak jotak Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this kind of discussion for another task ... ideally we'd like the dashboard to auto-scale time units, but it doesn't. So, without auto-scaling, I'd say we need to use the scale that would work better in most situations. I think latency is often better read in milliseconds (e.g. 10ms, 100ms, 1000 ms) than seconds (e.g. 0.01s, 0.1s, 1s). Latencies above 1s shouldn't be so frequent I guess (at least less frequent than those below 1s).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this looks like a feature request for the console dashboards?


### Workflow Description

A cluster administrator will be able to view this dashboard from the OpenShift Console, in the _Administrator_ view, under _Observe_ > _Dashboards_ menu.
Several dashboards are already listed there, e.g:

- `Kubernetes / Compute Resources / Pod` (tag: `kubernetes-mixin`)
- `Kubernetes / Compute Resources / Workload` (tag: `kubernetes-mixin`)
- `Kubernetes / Networking / Cluster` (tag: `kubernetes-mixin`)
- `Node Exporter / USE Method / Cluster` (tag: `node-exporter-mixin`)
- `Node Exporter / USE Method / Node` (tag: `node-exporter-mixin`)

The new ingress dashboard will be listed there, as:
- `Networking / Ingress` (tag: `networking-mixin`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -mixin suffix exists because these dashboards come from upstream "mixins" (https://github.com/kubernetes-monitoring/kubernetes-mixin/ for instance). There's no reason to add it for the new dashboard.


This "Networking" category can potentially be used for other Network-related dashboards, such as for OVN, NetObserv, etc.

Clicking on this dashboard will open it, showing time-series charts, such as cluster ingress stats and HAProxy metrics.
On each chart, an "Inspect" link allows the user to view that metric from the _Metrics_ page, which allows the user to customize the query (for example, to modify the label filters, the grouping, etc.) and view the result directly.
Note that editing a query from the _Metrics_ page does not affect dashboards displayed in the _Dashboards_ page.
These behaviours are already implemented in the Console and do not necessitate any change.

### API Extensions

No planned change on the API.

### Implementation Details / Notes / Constraints

The new `ConfigMap` installed in `openshift-config-managed` should be named `grafana-dashboard-ingress` (the `grafana-dashboard-` prefix is common for all such dashboards).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the grafana- prefix is legacy. You're free to name your configmap as you wish and I'd recommend dashboard-ingress.


It needs to be labelled with `console.openshift.io/dashboard: "true"`.

Dashboards have a `title` field as part of their Grafana JSON model, which is displayed in the OpenShift Console where dashboards are listed.
Here `title` should be `Networking / Ingress`.

Dashboards should also have a tag that identifies their supplier in the OpenShift Console; existing tags are: `kubernetes-mixin`, `node-exporter-mixin`, `prometheus-mixin`, `etcd-mixin`.
A new tag named `networking-mixin` should be used for this new dashboard. This tag aims to group all dashboards related to networking, such as OVN dashboards and NetObserv dashboards that may be added in the future.
This tag is directly set in the static JSON definition of the dashboard. No more action is required for tag creation.

A typical procedure to design and create the dashboard is to use Grafana for designing purpose, then export the dashboard as JSON and save it as an asset in the target repository. Then, it can be embedded in the built artifact using `go:embed`, and injected into a `ConfigMap`. Example:
jotak marked this conversation as resolved.
Show resolved Hide resolved

```golang
//go:embed dashboard.json
var dashboardEmbed string

func buildDashboard() *corev1.ConfigMap {
configMap := corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: "grafana-dashboard-ingress",
Namespace: "openshift-config-managed",
Labels: map[string]string{
"console.openshift.io/dashboard": "true",
},
},
Data: map[string]string{
"dashboard.json": dashboardEmbed,
},
}
return &configMap
}
```
jotak marked this conversation as resolved.
Show resolved Hide resolved

This is achieved by a new controller added to the operator, in charge of reconciling the dashboard. This controller watches the Infrastructure object which it is bound to, and the generated configmap.
jotak marked this conversation as resolved.
Show resolved Hide resolved

The controller should deploy this configmap in the `openshift-config-managed` namespace. Any configmap deployed in this namespace with the `console.openshift.io/dashboard` label will be automatically picked by the monitoring operator and deployed in the OpenShift Console. The monitoring stack is responsible for querying the metrics as defined in the dashboard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit-picky) the console will query the monitoring stack API.


When the Ingress operator is upgraded to a new version, if this upgrade brings changes to the dashboard, the existing ConfigMap will be overwritten through reconciliation.

Note that, despite this work taking place in the Ingress Operator codebase, the implementation and initial maintenance are done by the NetObserv team.

### Risks and Mitigations

### Drawbacks

jotak marked this conversation as resolved.
Show resolved Hide resolved
## Design Details

### Test Plan

Unit test

- Verify that the embedded JSON is parseable (to avoid unintentional errors while manually editing the JSON).

E2E Tests

There are two scenarios depending on the cluster network topology

1. Verify that the cluster network topology is not external
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this step? Is this referring to the external topology mode that can be used on HyperShift to suppress creation of a default IngressController? Is this something that other E2E tests wouldn't need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. Verify that the new ConfigMap dashboard was created.
3. Delete the Configmap
4. Verify that the ConfigMap dashboard is recreated.
5. Modify the Configmap json
6. Verify that the ConfigMap is reinitialized to the right value.
Comment on lines +177 to +181
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recently we have been doing this sort of test using unit tests with a fake client and fake reconciler. You can search for Test_Reconcile to see some examples.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test procedure was implemented as a e2e test in the PR:
https://github.com/openshift/cluster-ingress-operator/pull/962/files#diff-0c5cdc75d6910152df4d57b17e2257ae3a7cbf5bc718d868087f3e6b9e903022

Should we transform the test to a unit test?


1. Verify that the cluster network topology is external
2. Verify that the ConfigMap was not deployed

### Graduation Criteria

This enhancement does not require graduation milestones.

#### Dev Preview -> Tech Preview

N/A; This feature will go directly to GA.

#### Tech Preview -> GA

N/A; This feature will go directly to GA.

#### Removing a deprecated feature

N/A

### Upgrade / Downgrade Strategy

Upgrading from a previous release must install the new dashboard without requiring any intervention.
jotak marked this conversation as resolved.
Show resolved Hide resolved

On next upgrades, if the dashboard already exists and if the new version brings changes to the dashboard, the existing one will be overwritten with the new one.
jotak marked this conversation as resolved.
Show resolved Hide resolved

In case of a downgrade to a version not having this dashboard, the related `ConfigMap` would remain in the cluster, hence the dashboard would still be visible. This should not cause any trouble since the related HAProxy metrics did already exist in prior versions. If for some reason this is still perceived as an issue, the `ConfigMap` can be manually deleted to remove the dashboard:

```
oc delete configmap grafana-dashboard-ingress -n openshift-config-managed
```

### Version Skew Strategy

N/A
jotak marked this conversation as resolved.
Show resolved Hide resolved

### Operational Aspects of API Extensions

#### Failure Modes

N/A

#### Support Procedures

jotak marked this conversation as resolved.
Show resolved Hide resolved
- If the dashboard doesn't show up in menu "Observe" > "Dashboards"

Make sure a `ConfigMap` named `grafana-dashboard-ingress` was created in namespace `openshift-config-managed`.
If not, check the ingress operator pod logs for any error that could be related to dashboard creation.

- If the dashboard shows unexpected empty charts

Click the "Inspect" link to see the Prometheus query from the "Observe" > "Metrics" view. You can verify if the metric exists by typing just the metric name: for instance, if the query is `avg(haproxy_server_http_average_response_latency_milliseconds != 0) by (route)`, try just looking at `haproxy_server_http_average_response_latency_milliseconds` and see if there is any data point.

If there is not, check if other HAProxy metrics exist, such as `haproxy_server_up`. If not, something might be wrong in the HAProxy pods as they don't seem to be generating any metric.

Another possibility is that some of the HAProxy metrics definitions changed (metrics names, or labels). In that case it requires to update the dashboards: please open an issue.


## Implementation History

## Alternatives
jotak marked this conversation as resolved.
Show resolved Hide resolved

A first alternative would be to expose the metric so a customer can create another dashboard outside of the console.

Another alternative would be to let the customer create his own grafana dashboard and then create the configmap with the exported JSON. This requires some knowledge about the different metrics and about prometheus query language. This alternative is already possible for the customer.