-
Notifications
You must be signed in to change notification settings - Fork 483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPSTRAT-139: Ingress operator dashboard #1431
Changes from 16 commits
89b5b0f
799665e
25e3039
8206b5c
042e38c
59deb63
0214fc5
640e464
0e7144d
f35df4a
dafc6bb
324bb70
b0c1185
917eb25
fcfbc6c
c9ab59a
19f861e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,248 @@ | ||||||
--- | ||||||
title: ingress-dashboard | ||||||
authors: | ||||||
- "@jotak" | ||||||
- "@OlivierCazade" | ||||||
reviewers: | ||||||
- "@Miciah" | ||||||
- "@candita" | ||||||
approvers: | ||||||
- "@Miciah" | ||||||
- "@candita" | ||||||
api-approvers: | ||||||
- "@deads2k" | ||||||
creation-date: 2023-06-28 | ||||||
last-updated: 2023-06-28 | ||||||
tracking-link: | ||||||
- "https://issues.redhat.com/browse/OCPSTRAT-139" | ||||||
- "https://issues.redhat.com/browse/NETOBSERV-1052" | ||||||
see-also: | ||||||
replaces: | ||||||
superseded-by: | ||||||
--- | ||||||
|
||||||
# Ingress dashboard creation for the OpenShift Console | ||||||
|
||||||
## Release Signoff Checklist | ||||||
|
||||||
- [x] Enhancement is `implementable`. | ||||||
- [x] Design details are appropriately documented from clear requirements. | ||||||
- [x] Test plan is defined. | ||||||
- [ ] graduation criteria for dev preview, tech preview, GA: N/A | ||||||
- [ ] User-Facing documentation is created in [openshift-docs](https://github.com/openshift/openshift-docs/). | ||||||
|
||||||
## Summary | ||||||
|
||||||
The goal is to add a new dashboard in the OpenShift Console (in menu “Observe” > “Dashboards”), dedicated to metrics related to Ingress. | ||||||
Such dashboards are deployed through configmaps, so a new controller will be added to the ingress operator to manage this configmap. | ||||||
|
||||||
Ingress components, such as HAProxy, already provide some metrics that are exposed and collected by Prometheus / Cluster Monitoring. Administrators should be able to get a consolidated view, using a subset of these metrics, to get a quick overview of the cluster state. This enhancement proposal is part of a wider initiative to improve the observability of networking components (see https://issues.redhat.com/browse/OCPSTRAT-139). | ||||||
|
||||||
## Motivation | ||||||
|
||||||
While ingress related metrics already exist and are accessible in the OpenShift Console (via the menu “Observe” > “Metrics”), there is no consolidated view presenting a summary of them. There are existing dashboards in other areas (such as etcd, compute resources, etc.), but networking today is less represented there, despite its importance for monitoring and troubleshooting. | ||||||
Metrics such as HAProxy error rates or latency can be made more visible by promoting them in a dashboard. | ||||||
|
||||||
In addition, product management has shown interest in providing and making more visible some cluster-wide statistics, such as the number of routes and shards in use. | ||||||
|
||||||
More details on the new dashboard content is provided below. | ||||||
|
||||||
### Goals | ||||||
|
||||||
1. Design and implement a new dashboard using a selection of metrics listed below. | ||||||
2. Update the Ingress Operator to deploy this dashboard configmap, making it accessible for the Cluster Monitoring stack. | ||||||
|
||||||
### Non-Goals | ||||||
|
||||||
- This work is not intended to provide a comprehensive set of metrics all at once. Instead, the intent is to "start small", setting in place all the mechanisms in code, and iterate later based on feedback to add or amend elements in the dashboard without any changes outside of the dashboard json definition file. | ||||||
- This enhancement does not include any new metric creation or exposition: only already available metrics are considered. If discussions lead to consider the creation of new metrics, this could be the purpose of a follow-up enhancement. | ||||||
|
||||||
### User Stories | ||||||
|
||||||
> As a cluster administrator, I want to get a quick overview of general cluster statistics, such as the number of routes or shards in use. | ||||||
|
||||||
> As a cluster administrator, I want to get a quick insight in incoming traffic statistics, such as on latency and HTTP errors. | ||||||
|
||||||
## Proposal | ||||||
|
||||||
To make a new dashboard discoverable by Cluster Monitoring, a `ConfigMap` needs to be created in the namespace `openshift-config-managed`, containing a static dashboard definition in Grafana format (JSON). The dashboard datasource has to be Cluster Monitoring's Prometheus. | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
The Ingress Operator is responsible for creating and reconciling this `ConfigMap`. We assume all metrics used in the dashboard are present unconditionally, which allows us to create a static dashboard unconditionally as well. | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
The Ingress Operator embeds a static json manifest for the full dashboard. | ||||||
If the operator detects any change between the deployed dashboard and the embedded one, the operator replaces the deployed dashboard with the embedded one. | ||||||
|
||||||
### Dashboard content | ||||||
|
||||||
At the top, a summary row presenting global cluster statistics as text panels: | ||||||
|
||||||
- Total current byte rate in (aggregated across all routes/shards): _sum(rate(haproxy_server_bytes_in_total[1m]))_ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you want at least 4 x the scrape interval which I believe is 30s for the routers. The same comment applies to the other rate queries.
Suggested change
|
||||||
- Total current byte rate out (aggregated across all routes/shards): _sum(rate(haproxy_server_bytes_out_total[1m]))_ | ||||||
- Total current number of routes: _count(count(haproxy_server_up == 1) by (route))_ | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no need to
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm this isn't the same thing, the one you gave doesn't count just routes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry the request there was not up to date.
|
||||||
- Total current number of ingress controllers: _count(count(haproxy_server_up == 1) by (pod))_ | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Below this top summary, more detailed time-series panels. Each of these panel come in two flavours: aggregated per route, and aggregated per controller instance. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need by route and router instance? |
||||||
|
||||||
- Byte rate in, per route or per controller instance: _sum(rate(haproxy_server_bytes_in_total[1m])) by (route)_ or _sum(rate(haproxy_server_bytes_in_total[1m])) by (pod)_ | ||||||
|
||||||
- Byte rate out, per route or per controller instance: _sum(rate(haproxy_server_bytes_out_total[1m])) by (route)_ or _sum(rate(haproxy_server_bytes_out_total[1m])) by (pod)_ | ||||||
|
||||||
- Response error rate, per route or per controller instance: _sum(irate(haproxy_server_response_errors_total[180s])) by (route)_ or _sum(irate(haproxy_server_response_errors_total[180s])) by (pod)_ | ||||||
|
||||||
- Average response latency, per route or per controller instance: _avg(haproxy_server_http_average_response_latency_milliseconds != 0) by (route)_ or _avg(haproxy_server_http_average_response_latency_milliseconds != 0) by (pod)_ | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
jotak marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (nit) I would normalize the query result to seconds. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had this kind of discussion for another task ... ideally we'd like the dashboard to auto-scale time units, but it doesn't. So, without auto-scaling, I'd say we need to use the scale that would work better in most situations. I think latency is often better read in milliseconds (e.g. 10ms, 100ms, 1000 ms) than seconds (e.g. 0.01s, 0.1s, 1s). Latencies above 1s shouldn't be so frequent I guess (at least less frequent than those below 1s). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm this looks like a feature request for the console dashboards? |
||||||
|
||||||
### Workflow Description | ||||||
|
||||||
A cluster administrator will be able to view this dashboard from the OpenShift Console, in the _Administrator_ view, under _Observe_ > _Dashboards_ menu. | ||||||
Several dashboards are already listed there, e.g: | ||||||
|
||||||
- `Kubernetes / Compute Resources / Pod` (tag: `kubernetes-mixin`) | ||||||
- `Kubernetes / Compute Resources / Workload` (tag: `kubernetes-mixin`) | ||||||
- `Kubernetes / Networking / Cluster` (tag: `kubernetes-mixin`) | ||||||
- `Node Exporter / USE Method / Cluster` (tag: `node-exporter-mixin`) | ||||||
- `Node Exporter / USE Method / Node` (tag: `node-exporter-mixin`) | ||||||
|
||||||
The new ingress dashboard will be listed there, as: | ||||||
- `Networking / Ingress` (tag: `networking-mixin`) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The |
||||||
|
||||||
This "Networking" category can potentially be used for other Network-related dashboards, such as for OVN, NetObserv, etc. | ||||||
|
||||||
Clicking on this dashboard will open it, showing time-series charts, such as cluster ingress stats and HAProxy metrics. | ||||||
On each chart, an "Inspect" link allows the user to view that metric from the _Metrics_ page, which allows the user to customize the query (for example, to modify the label filters, the grouping, etc.) and view the result directly. | ||||||
Note that editing a query from the _Metrics_ page does not affect dashboards displayed in the _Dashboards_ page. | ||||||
These behaviours are already implemented in the Console and do not necessitate any change. | ||||||
|
||||||
### API Extensions | ||||||
|
||||||
No planned change on the API. | ||||||
|
||||||
### Implementation Details / Notes / Constraints | ||||||
|
||||||
The new `ConfigMap` installed in `openshift-config-managed` should be named `grafana-dashboard-ingress` (the `grafana-dashboard-` prefix is common for all such dashboards). | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the |
||||||
|
||||||
It needs to be labelled with `console.openshift.io/dashboard: "true"`. | ||||||
|
||||||
Dashboards have a `title` field as part of their Grafana JSON model, which is displayed in the OpenShift Console where dashboards are listed. | ||||||
Here `title` should be `Networking / Ingress`. | ||||||
|
||||||
Dashboards should also have a tag that identifies their supplier in the OpenShift Console; existing tags are: `kubernetes-mixin`, `node-exporter-mixin`, `prometheus-mixin`, `etcd-mixin`. | ||||||
A new tag named `networking-mixin` should be used for this new dashboard. This tag aims to group all dashboards related to networking, such as OVN dashboards and NetObserv dashboards that may be added in the future. | ||||||
This tag is directly set in the static JSON definition of the dashboard. No more action is required for tag creation. | ||||||
|
||||||
A typical procedure to design and create the dashboard is to use Grafana for designing purpose, then export the dashboard as JSON and save it as an asset in the target repository. Then, it can be embedded in the built artifact using `go:embed`, and injected into a `ConfigMap`. Example: | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```golang | ||||||
//go:embed dashboard.json | ||||||
var dashboardEmbed string | ||||||
|
||||||
func buildDashboard() *corev1.ConfigMap { | ||||||
configMap := corev1.ConfigMap{ | ||||||
ObjectMeta: metav1.ObjectMeta{ | ||||||
Name: "grafana-dashboard-ingress", | ||||||
Namespace: "openshift-config-managed", | ||||||
Labels: map[string]string{ | ||||||
"console.openshift.io/dashboard": "true", | ||||||
}, | ||||||
}, | ||||||
Data: map[string]string{ | ||||||
"dashboard.json": dashboardEmbed, | ||||||
}, | ||||||
} | ||||||
return &configMap | ||||||
} | ||||||
``` | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
This is achieved by a new controller added to the operator, in charge of reconciling the dashboard. This controller watches the Infrastructure object which it is bound to, and the generated configmap. | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
The controller should deploy this configmap in the `openshift-config-managed` namespace. Any configmap deployed in this namespace with the `console.openshift.io/dashboard` label will be automatically picked by the monitoring operator and deployed in the OpenShift Console. The monitoring stack is responsible for querying the metrics as defined in the dashboard. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (nit-picky) the console will query the monitoring stack API. |
||||||
|
||||||
When the Ingress operator is upgraded to a new version, if this upgrade brings changes to the dashboard, the existing ConfigMap will be overwritten through reconciliation. | ||||||
|
||||||
Note that, despite this work taking place in the Ingress Operator codebase, the implementation and initial maintenance are done by the NetObserv team. | ||||||
|
||||||
### Risks and Mitigations | ||||||
|
||||||
### Drawbacks | ||||||
|
||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
## Design Details | ||||||
|
||||||
### Test Plan | ||||||
|
||||||
Unit test | ||||||
|
||||||
- Verify that the embedded JSON is parseable (to avoid unintentional errors while manually editing the JSON). | ||||||
|
||||||
E2E Tests | ||||||
|
||||||
There are two scenarios depending on the cluster network topology | ||||||
|
||||||
1. Verify that the cluster network topology is not external | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the point of this step? Is this referring to the external topology mode that can be used on HyperShift to suppress creation of a default IngressController? Is this something that other E2E tests wouldn't need? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @OlivierCazade 's answer here: #1431 (comment) |
||||||
2. Verify that the new ConfigMap dashboard was created. | ||||||
3. Delete the Configmap | ||||||
4. Verify that the ConfigMap dashboard is recreated. | ||||||
5. Modify the Configmap json | ||||||
6. Verify that the ConfigMap is reinitialized to the right value. | ||||||
Comment on lines
+177
to
+181
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Recently we have been doing this sort of test using unit tests with a fake client and fake reconciler. You can search for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This test procedure was implemented as a e2e test in the PR: Should we transform the test to a unit test? |
||||||
|
||||||
1. Verify that the cluster network topology is external | ||||||
2. Verify that the ConfigMap was not deployed | ||||||
|
||||||
### Graduation Criteria | ||||||
|
||||||
This enhancement does not require graduation milestones. | ||||||
|
||||||
#### Dev Preview -> Tech Preview | ||||||
|
||||||
N/A; This feature will go directly to GA. | ||||||
|
||||||
#### Tech Preview -> GA | ||||||
|
||||||
N/A; This feature will go directly to GA. | ||||||
|
||||||
#### Removing a deprecated feature | ||||||
|
||||||
N/A | ||||||
|
||||||
### Upgrade / Downgrade Strategy | ||||||
|
||||||
Upgrading from a previous release must install the new dashboard without requiring any intervention. | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
On next upgrades, if the dashboard already exists and if the new version brings changes to the dashboard, the existing one will be overwritten with the new one. | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
In case of a downgrade to a version not having this dashboard, the related `ConfigMap` would remain in the cluster, hence the dashboard would still be visible. This should not cause any trouble since the related HAProxy metrics did already exist in prior versions. If for some reason this is still perceived as an issue, the `ConfigMap` can be manually deleted to remove the dashboard: | ||||||
|
||||||
``` | ||||||
oc delete configmap grafana-dashboard-ingress -n openshift-config-managed | ||||||
``` | ||||||
|
||||||
### Version Skew Strategy | ||||||
|
||||||
N/A | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
### Operational Aspects of API Extensions | ||||||
|
||||||
#### Failure Modes | ||||||
|
||||||
N/A | ||||||
|
||||||
#### Support Procedures | ||||||
|
||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
- If the dashboard doesn't show up in menu "Observe" > "Dashboards" | ||||||
|
||||||
Make sure a `ConfigMap` named `grafana-dashboard-ingress` was created in namespace `openshift-config-managed`. | ||||||
If not, check the ingress operator pod logs for any error that could be related to dashboard creation. | ||||||
|
||||||
- If the dashboard shows unexpected empty charts | ||||||
|
||||||
Click the "Inspect" link to see the Prometheus query from the "Observe" > "Metrics" view. You can verify if the metric exists by typing just the metric name: for instance, if the query is `avg(haproxy_server_http_average_response_latency_milliseconds != 0) by (route)`, try just looking at `haproxy_server_http_average_response_latency_milliseconds` and see if there is any data point. | ||||||
|
||||||
If there is not, check if other HAProxy metrics exist, such as `haproxy_server_up`. If not, something might be wrong in the HAProxy pods as they don't seem to be generating any metric. | ||||||
|
||||||
Another possibility is that some of the HAProxy metrics definitions changed (metrics names, or labels). In that case it requires to update the dashboards: please open an issue. | ||||||
|
||||||
|
||||||
## Implementation History | ||||||
|
||||||
## Alternatives | ||||||
jotak marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
A first alternative would be to expose the metric so a customer can create another dashboard outside of the console. | ||||||
|
||||||
Another alternative would be to let the customer create his own grafana dashboard and then create the configmap with the exported JSON. This requires some knowledge about the different metrics and about prometheus query language. This alternative is already possible for the customer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not directly related to this enhancement proposal but we had a discussion [1] a few weeks ago with @candita and @Miciah about the namespace/pod/service labels. TL;DR: HAProxy metrics have labels identifying the router pod (
namespace
,pod
,service
) and labels identifying the service pods (the same but with theexported_
prefix). The current situation forbids project users to see the metrics for their routes (the exported_namespace label should be stored as namespace label) and it would also prevent having a router dashboard in the developer console.If developer dashboards aren't in the scope of this proposal, it should at least be noted here.
[1] https://redhat-internal.slack.com/archives/CCH60A77E/p1664366340954199
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it's mentioned in several places that this is for the admin view, but I'll add a non-goal about devs for more clarity.