Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static mode tracing: Generate spanmetrics after load balancing. #5889

Merged
merged 2 commits into from
Dec 8, 2023

Conversation

ptodev
Copy link
Contributor

@ptodev ptodev commented Nov 30, 2023

PR Description

As the OTel documentation for the loadbalancer exporter states, a service routing key must be used when using both load balancing and spanmetrics:

The routing_key property is used to route spans to exporters based on different parameters. This functionality is currently enabled only for trace pipeline types. It supports one of the following values:

  • service: exports spans based on their service name. This is useful when using processors like the span metrics, so all spans for each service are sent to consistent collector instances for metric collection. Otherwise, metrics for the same services are sent to different collectors, making aggregations inaccurate.
  • traceID (default): exports spans based on their traceID.

Currently, static mode will generate spanmetrics before it has even done load balancing. This PR changes it so that spanmetrics are generated after loadbalancing.

In the code don't check whether the customer actually used the correct routing key of service when using spanmetrics and LB. Similarly, we don't check whether the routing key of traceID is used when using tail sampling.

Unfortunately this seems to be a very longstanding bug. It was apparently first introduced in #616 on May 26, 2021. It was merged shortly after spanmetrics was aded in the Agent via #499 on April 5, 2021. At the time the loadbalancing implications of using spanmetrics were probably not well understood.

I think users affected by this bug would probably see errors when remote writing metrics to Mimir, because different Agent might try to remote write the same series. So that's a "good" thing about this bug - if it causes a real problem, hopefully affected users have already seen such errors and acted on them.

I also tested this locally using configuration like this:

Agent config
server:
  log_level: debug

traces:
  configs:
  - name: default
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: "0.0.0.0:4320"
    load_balancing:
      exporter:
        insecure: true
      resolver:
        static:
          hostnames:
            - "localhost:4321"
      receiver_port: 4321
    spanmetrics:
      handler_endpoint: "localhost:8898"
      namespace: "paulin_test_"
    remote_write:
      - endpoint: tempo-prod-06-prod-gb-south-0.grafana.net:443
        basic_auth:
          username:
          password:

When running my local Agent I could see the spanmetrics on http://localhost:8898/metrics, and I could see that metrics are being received and sent to Tempo via the metrics on http://localhost:12345/metrics.

PR Checklist

  • CHANGELOG.md updated
  • Documentation added
  • Tests updated
  • Config converters updated

@ptodev ptodev force-pushed the ptodev/spanmetrics-should-go-in-backing-agent branch from c19e190 to 65d45d7 Compare December 7, 2023 12:47
@@ -978,6 +980,7 @@ func orderProcessors(processors []string, splitPipelines bool) [][]string {
if processor == "batch" ||
processor == "tail_sampling" ||
processor == "automatic_logging" ||
processor == "spanmetrics" ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is the main bug fix, so that if we encounter the spanmetrics processor, we order it accordingly, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's the main bugfix.

if we encounter the spanmetrics processor, we order it accordingly, right?

I suppose your question is for Flow, since static mode automatically orders them? In Flow, if there are two Agents which ingest traces in a load balanced way, they need to be behind a load balancing exporter for spanmetrics to work ok.

Copy link
Member

@tpaschalis tpaschalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell this looks good, nice catch! Feel free to merge this!

It'd be kinda harder in Flow mode, but do you think similar warning(s) would make sense there? If so, let's just open a tracking issue so we don't forget about it.

@ptodev
Copy link
Contributor Author

ptodev commented Dec 8, 2023

It'd be kinda harder in Flow mode, but do you think similar warning(s) would make sense there? If so, let's just open a tracking issue so we don't forget about it.

It'd be hard to put such warnings in Flow mode, because we can't assume much about what the users want to do. Static mode is more restrictive, so it makes a bit more sense here. But even in static mode these warning don't always make sense, because the Agent doesn't know if it's running in a cluster or not.

@ptodev ptodev merged commit d7fbffa into main Dec 8, 2023
8 checks passed
@ptodev ptodev deleted the ptodev/spanmetrics-should-go-in-backing-agent branch December 8, 2023 15:27
BarunKGP pushed a commit to BarunKGP/grafana-agent that referenced this pull request Feb 20, 2024
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 21, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants