Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low cardinality metrics issues #7719

Closed
antontroshin opened this issue May 7, 2024 · 13 comments
Closed

Low cardinality metrics issues #7719

antontroshin opened this issue May 7, 2024 · 13 comments
Labels
area/observability stale Issues and PRs without response
Milestone

Comments

@antontroshin
Copy link
Contributor

Intro

With the low cardinality metrics enabled, a lot of metrics are losing their value and in some cases even become completely unusable as they do not have any useful information provided except the application ID.
Monitoring systems and dashboards previously using high cardinality settings, are broken and there's seems no suitable replacement can be found with low cardinality enabled.
The value of having a low cardinality setting is understandable and the use-cases differ between the users, however having the option to have a middle ground would be great.
Please consider some scenarios where low cardinality might render metrics unusable (see below).

Test method

I've done an investigation by building a few simple applications using the code samples mentioned below.
Using simple applications to send/receive HTTP requests using 2 different methods, based on the examples provided in:
https://docs.dapr.io/developing-applications/building-blocks/service-invocation/howto-invoke-discover-services/#invoke-the-service (plain HTTP calls)
https://github.com/dapr/go-sdk/tree/main/examples/service/serving/http (SDK server)
https://github.com/dapr/go-sdk/tree/main/examples/service/client (SDK client)

Metrics checked

Application: service-invocation-http-client

Metrics: dapr_http_server_latency_ (_sum, _count, _bucket)

High cardinality:

dapr_http_server_latency_bucket{app_id="service-invocation-http-client",method="POST",path="/post-service-invocation-plain",status="200",le="100"} 1160
dapr_http_server_latency_bucket{app_id="service-invocation-http-client",method="GET",path="/get-service-invocation-plain",status="500",le="100"} 321
dapr_http_server_latency_bucket{app_id="service-invocation-http-client",method="DELETE",path="/delete-service-invocation-plain",status="500",le="100"} 299
dapr_http_server_latency_bucket{app_id="service-invocation-http-client",method="PUT",path="/put-service-invocation-plain",status="200",le="100"} 1149

Low cardinality:

dapr_http_server_latency_bucket{app_id="service-invocation-http-client",method="InvokeService/service-invocation-http-server",status="200",le="100"} 3584

Metrics: dapr_http_server_request_count

High cardinality:

dapr_http_server_request_count{app_id="service-invocation-http-client",method="GET",path="/get-service-invocation-plain",status="200"} 1742
dapr_http_server_request_count{app_id="service-invocation-http-client",method="PUT",path="/put-service-invocation-plain",status="200"} 1693

Low cardinality:

dapr_http_server_request_count{app_id="service-invocation-http-client",method="InvokeService/service-invocation-http-server",status="200"} 6057

Issue: With low cardinality calling all paths became merged into one single method name, while hiding the real path that is being called and each of those paths is likely to have a different latency.
The same goes for the detection of issues of certain paths, for example, POST can produce errors, which might be hidden, and no way to know which path produces error responses.
With latency metrics missing paths, monitoring will be unable to track which method call is slow or fast or having an issue, all percentiles become useless as well.
Example:

  • calls to /put-service-invocation-plain take 100ms
  • calls to /get-service-invocation-plain take 5ms

Applications: service-invocation-http-server, service-invocation-http-client

Metrics: dapr_http_server_request_bytes_ (_sum, _count, _bucket)

High cardinality:

dapr_http_server_request_bytes_bucket{app_id="service-invocation-http-client",le="4096"} 9702

Low cardinality:

dapr_http_server_request_bytes_bucket{app_id="service-invocation-http-client",le="4096"} 4950

Applications: service-invocation-http-server, service-invocation-http-client

Metrics: dapr_http_server_response_ (_sum, _count, _bucket)

High cardinality:

dapr_http_server_response_bytes_bucket{app_id="service-invocation-http-client",le="4096"} 9702
dapr_http_server_response_bytes_count{app_id="service-invocation-http-client"} 9702
dapr_http_server_response_bytes_count{app_id="service-invocation-http-server"} 1197

Low cardinality:

dapr_http_server_response_bytes_bucket{app_id="service-invocation-http-client",le="4096"} 8340
dapr_http_server_response_bytes_count{app_id="service-invocation-http-client"} 8340
dapr_http_server_response_bytes_count{app_id="service-invocation-http-server"} 998

Issue: The metrics don't record any labels other than the current app_id, for both plain HTTP client and server. Unable to differentiate the calls being made at all, only having the current application ID, missing source/destination app ID and the path/method being called.


Applications: service-invocation-http-server

Metrics: dapr_http_client_roundtrip_latency_ (_sum, _count, _bucket)

High cardinality:

dapr_http_client_roundtrip_latency_bucket{app_id="service-invocation-http-server",method="GET",path="dapr/config",status="404",le="3"} 1
dapr_http_client_roundtrip_latency_bucket{app_id="service-invocation-http-server",method="DELETE",path="delete-service-invocation-plain",status="500",le="13"} 691
dapr_http_client_roundtrip_latency_bucket{app_id="service-invocation-http-server",method="GET",path="get-service-invocation-plain",status="200",le="6"} 2857

Low cardinality:

dapr_http_client_roundtrip_latency_bucket{app_id="service-invocation-http-server",status="404",le="6"} 1

Issue: With the low cardinality, again all calls are merged into one, in this case even worse, the only labels left are app_id (name of the current application) and status. Making metrics unusable and unable to distinguish the specific calls and troubleshoot properly.
In the example samples above, the 404 status calls are coming from internal calls to the path dapr/config, which are not part of the application. In low cardinality mode, these 404 errors become part of the application and might falsely cause issues in monitoring systems making them think something is wrong with the actual application.


Applications: service-invocation-http-server

Metrics: dapr_http_client_sent_bytes_ (_sum, _count, _bucket)

High cardinality:

dapr_http_client_sent_bytes_bucket{app_id="service-invocation-http-server",method="GET",path="dapr/config",status="",le="65536"} 1
dapr_http_client_sent_bytes_bucket{app_id="service-invocation-http-server",method="DELETE",path="delete-service-invocation-plain",status="",le="16384"} 3576
dapr_http_client_sent_bytes_bucket{app_id="service-invocation-http-server",method="PUT",path="put-service-invocation-plain",status="",le="65536"} 3576

Low cardinality:

dapr_http_client_sent_bytes_bucket{app_id="service-invocation-http-server",status="",le="16384"} 12241

Issue: As in the above example, the same issue can be observed here, no way of knowing what path is being called, can't measure bytes sent. In addition, internal dapr/config is mixed into the application metric and the metric value is now incorrect.

Note: Using go-sdk for both client and server service invocations produces the same results except the path value as a full Dapr path call with version etc.

Additional Files

Attaching dumps from the 9090 port from all 8 applications used in the testing.

plain-client-high.txt
plain-client-low.txt
plain-server-high.txt
plain-server-low.txt
sdk-client-high.txt
sdk-client-low.txt
sdk-server-high.txt
sdk-server-low.txt

@artursouza
Copy link
Member

First, I propose that we don't make low cardinality the default behavior in 1.14.

Instead, we use 1.14 to implement the following in low cardinality:

  • Allow regex of URL patterns to be used in low cardinality too, if it matches, then add the path's pattern to the metric. If there is no match, then don't add the path (like it is today).
  • Always add the HTTP verb in low cardinality.
  • (If possible) always include the app's gRPC method being invoked

@artursouza
Copy link
Member

Addressing another point, low cardinality should be fine with adding path=/dapr/config or path=/healthz and any other Dapr-specific paths without an identifier.

@artursouza
Copy link
Member

artursouza commented May 9, 2024

I think there is a better solution than the custom regex. Dapr (in both low and high cardinality) would apply a series of builtin regexes to remove identifiers from the URL paths. If the path remains unmodified, low cardinality defaults to hiding the path while high cardinality defaults to showing the path in metrics. If custom regex and ID removal are both enabled, the built in ID removal is only applied if no custom regex matches.

Examples:

/customers/092e69cc-9950-49c1-ac53-ba2c5b47ab12 -> /customers/UUID
/customers/[email protected] -> /customers/EMAIL
/orders/123 -> /orders/NUMBER
/orders/092e69cc-9950-49c1-ac53-ba2c5b47ab12/items/100 -> /orders/UUID/items/NUMBER

In summary:

  • Don't make low cardinality default in 1.14
  • Add HTTP verb for all HTTP calls
  • Add all known Dapr paths to the app that don't contain an ID: /dapr/config, /healthz, /subscriptions
  • Offer regex in low cardinality mode as well as opt-in
  • Offer a configuration to opt-in for removal of IDs in low & high cardinality modes
  • (If possible) always include the app's gRPC method being invoked

Ideally, we converge into low cardinality with automatic ID removal being the default behavior.

@yaron2
Copy link
Member

yaron2 commented May 9, 2024

Ideally, we converge into low cardinality with automatic ID removal being the default behavior

Do you mean converge into high cardinality?

@artursouza
Copy link
Member

Ideally, we converge into low cardinality with automatic ID removal being the default behavior

Do you mean converge into high cardinality?

I mean that eventually (not in 1.14) we will have low cardinality with automatic ID removal for HTTP paths as the default behavior. I think high cardinality will be opt-in if not removed.

@alicejgibbons
Copy link
Contributor

I think Artur is saying that the low cardinality will be the default since that with the built-in regex identifiers only the remaining paths that the identifiers are not matched will be shown as the "low cardinality" metric we know today.

@ItalyPaleAle
Copy link
Contributor

The usage of regexs is going to be a problem for both performance and usability reasons.
There's also no way to be able to create regexs that match every kind of ID perfectly. What if the ID is a slug ("/blog/my-post-title"), a phone number (every country has its own different format), a national ID card number (again different per country/state)?

@ItalyPaleAle
Copy link
Contributor

I think there's some additional clarification that was provided, based on the context from the issue and PR: #6919 #6723 #6581

  1. Every new "pivot" that gets added to the metrics causes significant memory usage, since it creates a new bucket. We have seen Dapr crashing in production with OOMs after using 500MB or more of memory just for the metrics subsystem, as Dapr was used for service invocation against RESTful endpoints (where an ID was part of the URL).
    The point of low cardinality was indeed to have as little cardinality as possible. Anything that adds a bucket for each HTTP verb, or partial URL, is not "low" cardinality.
  2. There's also a matter of symmetry with gRPC APIs. Even before the changes in Dapr 1.13, gRPC APIs were always low-cardinality, due to the fact that the "URL" was just the method name. With the low cardinality change implement for HTTP, the behavior is now consistent.
  3. Users have always been able to use regular expressions to "trim" cardinality for metrics emitted by the HTTP endpoints, but:
    • There's the old mantra that if you try to solve a problem with a regular expression, now you have two problems (see Jeff Atwood)
    • Regexs have a non-insignificant performance impact

I understand that there are users who do want to see higher cardinality for HTTP metrics. This is why we maintained a configuration option metrics.increasedCardinality which preserves the same behavior as Dapr 1.12. The default value for that changes to false in 1.14, but that is just the default value, and users can switch a configuration option.

@nelson-parente
Copy link
Contributor

nelson-parente commented May 16, 2024

Following up on the feedback above we would like to propose a few adjustments to the initial proposal and we will work on the items:

  • Open a Dapr proposal for path matching API. HTTP Metrics Path Matching proposals#58
  • Keep Low cardinality will be the default on 1.14
  • Include HTTP Verbs again
  • Include well-known Dapr paths (/dapr/config , /healthz , /subscriptions , etc)
  • Include status codes grouped: 2XX , 3XX , 4XX ,5XX
  • Add a new Top-level increasedCardinality config, which will be used if no app-level config is set.

@artursouza
Copy link
Member

artursouza commented May 25, 2024

I think there's some additional clarification that was provided, based on the context from the issue and PR: #6919 #6723 #6581

  1. Every new "pivot" that gets added to the metrics causes significant memory usage, since it creates a new bucket. We have seen Dapr crashing in production with OOMs after using 500MB or more of memory just for the metrics subsystem, as Dapr was used for service invocation against RESTful endpoints (where an ID was part of the URL).
    The point of low cardinality was indeed to have as little cardinality as possible. Anything that adds a bucket for each HTTP verb, or partial URL, is not "low" cardinality.

  2. There's also a matter of symmetry with gRPC APIs. Even before the changes in Dapr 1.13, gRPC APIs were always low-cardinality, due to the fact that the "URL" was just the method name. With the low cardinality change implement for HTTP, the behavior is now consistent.

  3. Users have always been able to use regular expressions to "trim" cardinality for metrics emitted by the HTTP endpoints, but:

    • There's the old mantra that if you try to solve a problem with a regular expression, now you have two problems (see Jeff Atwood)
    • Regexs have a non-insignificant performance impact

I understand that there are users who do want to see higher cardinality for HTTP metrics. This is why we maintained a configuration option metrics.increasedCardinality which preserves the same behavior as Dapr 1.12. The default value for that changes to false in 1.14, but that is just the default value, and users can switch a configuration option.

Regarding point (1), I made some comments in the proposal by @nelson-parente that will give a way for users and platforms to keep the existing behavior (if so they want).

Now on point (2), when discussing gRPC<->HTTP for service invocation, Dapr maintainers have moved away from trying to match HTTP and gRPC protocols a while ago, making it a non-issue. One of the consequences of that decision is that Dapr does not convert HTTP to gRPC or vice versa when invoking services, meaning the requestor's app must invoke with the same protocol that the receiving app listens to (sidecar to sidecar is always over gRPC but that is a "tunnel" for the original call and does not need to adhere to the app's protocol).

@dapr-bot
Copy link
Collaborator

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@dapr-bot dapr-bot added the stale Issues and PRs without response label Jul 24, 2024
@cicoyle cicoyle removed the stale Issues and PRs without response label Jul 25, 2024
@dapr-bot
Copy link
Collaborator

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@dapr-bot dapr-bot added the stale Issues and PRs without response label Sep 23, 2024
@dapr-bot
Copy link
Collaborator

This issue has been automatically closed because it has not had activity in the last 67 days. If this issue is still valid, please ping a maintainer and ask them to label it as pinned, good first issue, help wanted or triaged/resolved. Thank you for your contributions.

@dapr-bot dapr-bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 30, 2024
@github-project-automation github-project-automation bot moved this from Needs Owner to Done in v1.14 Release Tracking Board Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/observability stale Issues and PRs without response
Projects
Development

Successfully merging a pull request may close this issue.

8 participants