Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus metrics cardinality for the /decisions endpoint #446

Closed
claudio-benfatto opened this issue May 22, 2020 · 4 comments
Closed

Prometheus metrics cardinality for the /decisions endpoint #446

claudio-benfatto opened this issue May 22, 2020 · 4 comments

Comments

@claudio-benfatto
Copy link
Contributor

claudio-benfatto commented May 22, 2020

Describe the bug

More than a bug I would consider this a potentially unwanted behaviour.
My use case it to expose the request metrics by endpoint and status code for a high level understanding of how the service is performing/accessed. ie. number of 200s, 403s, 500s and so on...

Unfortunately the ory_oathkeeper_request_* metrics contain a very high cardinality label (request=) for the /decisions endpoint.
This is due to the fact that, by design, this endpoint is extremely flexible and dynamic and that the uri includes all the unique parts of the original request.

When oathkeeper is used in a multitenant environment with multiple services the cardinality of the resulting metrics is almost unmanageable.

Reproducing the bug

Run a recent version of oathkeeper, ie. > v0.38.1

Make sure the service is listening on the prometheus port:

INFO[0000] No tracer configured - skipping tracing setup
INFO[0000] Listening on http://:9000

use curl to send some requests to the /decisions endpoint, ie.

for service in service1 service2 service3
do
  for uri in users topics
  do
    curl http://localhost:4456/decisions/${service}/${uri}
  done
done

curl the /metrics endpoint:

curl -s http://localhost:9000/metrics | grep ory_oathkeeper_requests_total | grep -v "#"

You'll get an unique metric for each combination of ${service}, ${uri}

ory_oathkeeper_requests_total{method="GET",request="/decisions/service1/topics",service="oathkeeper-api",status_code="404"} 2
ory_oathkeeper_requests_total{method="GET",request="/decisions/service1/users",service="oathkeeper-api",status_code="404"} 2
ory_oathkeeper_requests_total{method="GET",request="/decisions/service2/topics",service="oathkeeper-api",status_code="404"} 2
ory_oathkeeper_requests_total{method="GET",request="/decisions/service2/users",service="oathkeeper-api",status_code="404"} 2
ory_oathkeeper_requests_total{method="GET",request="/decisions/service3/topics",service="oathkeeper-api",status_code="404"} 2
ory_oathkeeper_requests_total{method="GET",request="/decisions/service3/users",service="oathkeeper-api",status_code="404"} 2

Expected behavior

It would be nice to have the possibility not to include the uri part in the request label to better control the metrics' cardinality.
Up to a certain amount of metrics, this may even become an availability problem for the /metrics endpoint itself as the labels are potentially unbounded

I badly hacked the code to demonstrate via a code example what would be the expected behaviour, and here it is:

// UpdateRequest tracks total requests done
func (r *PrometheusRepository) UpdateRequest(service, request, method string, statusCode int) {

	re := regexp.MustCompile(`^/decisions/`)
	includeDecisionsURI := true

        // if intercepting a request to the /decisions endpoint generate a higher level metric
	if re.Match([]byte(request)) {
		RequestTotal.With(prometheus.Labels{
			"service":     service,
			"method":      method,
			"request":     "/decisions",
			"status_code": strconv.Itoa(statusCode),
		}).Inc()
	}
        // optionally generate more granular level metrics if required
	if includeDecisionsURI {
		RequestTotal.With(prometheus.Labels{
			"service":     service,
			"method":      method,
			"request":     request,
			"status_code": strconv.Itoa(statusCode),
		}).Inc()
	}
}

Environment

  • Version: v0.38.1
  • Environment: Build and run locally
@aeneasr
Copy link
Member

aeneasr commented May 22, 2020

Yeah, that makes total sense - PRs welcomed!

@claudio-benfatto
Copy link
Contributor Author

ok :) I'm fine giving an attempt to the implementation, but I would like, maybe, to discuss a bit beforehand the expected behaviour.

I see at least a few ways to approach this, with more or less flexible solutions.
Starting from the least flexible :), would a boolean flag set to true just to disable the subpaths for the /decisions uri, something like:

serve:
  prometheus:
    port: 9000
    host: localhost
    metrics_path: /metrics
    collapse_decisions_path: true

be enough?

@aeneasr
Copy link
Member

aeneasr commented May 23, 2020

Yeah, absolutely! I also think we can default to true here.

@mehdimas
Copy link
Contributor

mehdimas commented Apr 11, 2024

@aeneasr What about truncating query strings? I'm seeing high cardinality of metrics due to query strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants