Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics capture query was incorrect on OCPv4 #381

Merged
merged 2 commits into from
Jun 7, 2020

Conversation

agrare
Copy link
Member

@agrare agrare commented Jun 4, 2020

The prometheus metrics capture option was building a query that didn't return any data:

Metrics missing: [ContainerNode(1)] [#<ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode:0x00005622f0cbab88> crc-dv9sm-master-0] No data in response

The query string that was being built was incorrect: pod_name instead of pod, container_name instead of container, and the wrong job.

@agrare agrare requested review from cben and Fryguy as code owners June 4, 2020 16:27
@agrare agrare changed the title [WIP] Prometheus metrics capture query was incorrect [WIP] Prometheus metrics capture query was incorrect on OCPv4 Jun 4, 2020
@chessbyte chessbyte self-assigned this Jun 4, 2020
The prometheus metrics capture option was building a query that didn't
return any data:

```
Metrics missing: [ContainerNode(1)] [#<ManageIQ::Providers::Kubernetes::ContainerManager::ContainerNode:0x00005622f0cbab88> crc-dv9sm-master-0] No data in response
```
@agrare agrare force-pushed the fix_prometheus_metrics_capture branch from 6279318 to ebae628 Compare June 4, 2020 20:12
@agrare agrare changed the title [WIP] Prometheus metrics capture query was incorrect on OCPv4 Prometheus metrics capture query was incorrect on OCPv4 Jun 4, 2020
@agrare agrare removed the wip label Jun 4, 2020
@chessbyte
Copy link
Member

@agrare will this work on OCPv3 as well?

@agrare
Copy link
Member Author

agrare commented Jun 4, 2020

I'm not sure, it depends on when this changed on the openshift side. Was it while Prometheus was tech preview on v3.x? Then yes. If it was ok v4 then probably not.

If we need to split this into v3 and v4 Prometheus modules that'll take some more time.

@@ -44,7 +44,7 @@ def prometheus_options
}
end

def labels_to_s(labels, job = "kubernetes-cadvisor")
def labels_to_s(labels, job = "kubelet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the source of all the differences - you're getting data from a different source.
There is definitely a cadvisor running in v4, but not sure about data. I'd like to look more into this.

Copy link
Member Author

@agrare agrare Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I don't see any service other than kubelet which provides node metrics e.g. id="/" and for e.g. pod / container metrics if I add namespace= kubelet is the only job I see.

@chessbyte
Copy link
Member

If we need to split this into v3 and v4 Prometheus modules that'll take some more time.

Prefer to have v4 working first. Then we can add back v3 support, if needed.

@cben
Copy link
Contributor

cben commented Jun 4, 2020

If different queries are necessary, it's likely possible to craft a PromQL query that will work with either data without Ruby changes.

P.S. It's a shame there is no easy way to capture raw Prometheus data for tests and just evaluate varying PromQL on it. Then we could answer similar questions without a 3.11 cluster, and generally could test metrics better...
Prometheus has infrastructure for this in its own PromQL tests, but it's not conveniently exposed AFAIK.

@agrare
Copy link
Member Author

agrare commented Jun 4, 2020

I got access to a v3.11 cluster and it has the same bug as OCP v4 so it seems this fix would also fix v3.11

Actually it looks like the job="kubelet" but there is still pod_name instead of pod

If you know of a way to support both (all three?) of these that would be greatly appreciated

@chessbyte
Copy link
Member

@cben does this look good to merge to support v4 and we can look at getting v3 supported later, if needed?

@cben
Copy link
Contributor

cben commented Jun 7, 2020

Sure, if you're good with that 👍

Prometheus on 3.x was tech preview, I don't know how much effort it's worth to keep it working.

@oourfali
Copy link

oourfali commented Jun 7, 2020 via email

@chessbyte chessbyte merged commit 0c9dc84 into ManageIQ:master Jun 7, 2020
@agrare agrare deleted the fix_prometheus_metrics_capture branch June 7, 2020 18:01
simaishi pushed a commit that referenced this pull request Jun 11, 2020
Prometheus metrics capture query was incorrect on OCPv4

(cherry picked from commit 0c9dc84)
@simaishi
Copy link

Jansa backport details:

$ git log -1
commit 93e0dd6e4d25b050d3edf4d128bfd7e27eec9496
Author: Oleg Barenboim <[email protected]>
Date:   Sun Jun 7 09:45:03 2020 -0400

    Merge pull request #381 from agrare/fix_prometheus_metrics_capture

    Prometheus metrics capture query was incorrect on OCPv4

    (cherry picked from commit 0c9dc8452e674a01f9ec7b060a62fb9c02089b3a)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants