Static mode tracing: Generate spanmetrics after load balancing. #5889

ptodev · 2023-11-30T00:58:34Z

PR Description

As the OTel documentation for the loadbalancer exporter states, a service routing key must be used when using both load balancing and spanmetrics:

The routing_key property is used to route spans to exporters based on different parameters. This functionality is currently enabled only for trace pipeline types. It supports one of the following values:

service: exports spans based on their service name. This is useful when using processors like the span metrics, so all spans for each service are sent to consistent collector instances for metric collection. Otherwise, metrics for the same services are sent to different collectors, making aggregations inaccurate.

traceID (default): exports spans based on their traceID.

Currently, static mode will generate spanmetrics before it has even done load balancing. This PR changes it so that spanmetrics are generated after loadbalancing.

In the code don't check whether the customer actually used the correct routing key of service when using spanmetrics and LB. Similarly, we don't check whether the routing key of traceID is used when using tail sampling.

Unfortunately this seems to be a very longstanding bug. It was apparently first introduced in #616 on May 26, 2021. It was merged shortly after spanmetrics was aded in the Agent via #499 on April 5, 2021. At the time the loadbalancing implications of using spanmetrics were probably not well understood.

I think users affected by this bug would probably see errors when remote writing metrics to Mimir, because different Agent might try to remote write the same series. So that's a "good" thing about this bug - if it causes a real problem, hopefully affected users have already seen such errors and acted on them.

I also tested this locally using configuration like this:

Agent config

server:
  log_level: debug

traces:
  configs:
  - name: default
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4320"
    load_balancing:
      exporter:
        insecure: true
      resolver:
        static:
          hostnames:
            - "localhost:4321"
      receiver_port: 4321
    spanmetrics:
      handler_endpoint: "localhost:8898"
      namespace: "paulin_test_"
    remote_write:
      - endpoint: tempo-prod-06-prod-gb-south-0.grafana.net:443
        basic_auth:
          username:
          password:

When running my local Agent I could see the spanmetrics on http://localhost:8898/metrics, and I could see that metrics are being received and sent to Tempo via the metrics on http://localhost:12345/metrics.

PR Checklist

CHANGELOG.md updated
Documentation added
Tests updated
Config converters updated

tpaschalis · 2023-12-08T15:11:37Z

pkg/traces/config.go

@@ -978,6 +980,7 @@ func orderProcessors(processors []string, splitPipelines bool) [][]string {
 		if processor == "batch" ||
 			processor == "tail_sampling" ||
 			processor == "automatic_logging" ||
+			processor == "spanmetrics" ||


IIUC, this is the main bug fix, so that if we encounter the spanmetrics processor, we order it accordingly, right?

Yep, that's the main bugfix.

if we encounter the spanmetrics processor, we order it accordingly, right?

I suppose your question is for Flow, since static mode automatically orders them? In Flow, if there are two Agents which ingest traces in a load balanced way, they need to be behind a load balancing exporter for spanmetrics to work ok.

tpaschalis

As far as I can tell this looks good, nice catch! Feel free to merge this!

It'd be kinda harder in Flow mode, but do you think similar warning(s) would make sense there? If so, let's just open a tracking issue so we don't forget about it.

ptodev · 2023-12-08T15:27:12Z

It'd be kinda harder in Flow mode, but do you think similar warning(s) would make sense there? If so, let's just open a tracking issue so we don't forget about it.

It'd be hard to put such warnings in Flow mode, because we can't assume much about what the users want to do. Static mode is more restrictive, so it makes a bit more sense here. But even in static mode these warning don't always make sense, because the Agent doesn't know if it's running in a cluster or not.

Co-authored-by: Paschalis Tsilias <[email protected]>

Generate spanmetrics after load balancing.

65d45d7

ptodev force-pushed the ptodev/spanmetrics-should-go-in-backing-agent branch from c19e190 to 65d45d7 Compare December 7, 2023 12:47

tpaschalis reviewed Dec 8, 2023

View reviewed changes

tpaschalis approved these changes Dec 8, 2023

View reviewed changes

Merge branch 'main' into ptodev/spanmetrics-should-go-in-backing-agent

8282428

ptodev merged commit d7fbffa into main Dec 8, 2023
8 checks passed

ptodev deleted the ptodev/spanmetrics-should-go-in-backing-agent branch December 8, 2023 15:27

BarunKGP pushed a commit to BarunKGP/grafana-agent that referenced this pull request Feb 20, 2024

Generate spanmetrics after load balancing. (grafana#5889)

b165de1

Co-authored-by: Paschalis Tsilias <[email protected]>

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static mode tracing: Generate spanmetrics after load balancing. #5889

Static mode tracing: Generate spanmetrics after load balancing. #5889

ptodev commented Nov 30, 2023

tpaschalis Dec 8, 2023

ptodev Dec 8, 2023

tpaschalis left a comment

ptodev commented Dec 8, 2023

Static mode tracing: Generate spanmetrics after load balancing. #5889

Static mode tracing: Generate spanmetrics after load balancing. #5889

Conversation

ptodev commented Nov 30, 2023

PR Description

PR Checklist

tpaschalis Dec 8, 2023

Choose a reason for hiding this comment

ptodev Dec 8, 2023

Choose a reason for hiding this comment

tpaschalis left a comment

Choose a reason for hiding this comment

ptodev commented Dec 8, 2023