Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Otel Agent Collector is not showing correct value for dropped spans metrics #34279

Open
edtshuma opened this issue Jul 28, 2024 · 3 comments
Open
Labels

Comments

@edtshuma
Copy link

Component(s)

No response

Describe the issue you're reporting

I have an OTEL Collector instance deployed in Gateway mode. When I query the metric for dropped spans (Grafana Explore menu) I get no data even though I experienced dropped spans at that exact timestamp. I would like to alert on "dropped span" events and for that I am starting with the following query:

otelcol_processor_dropped_spans_total{cluster_name="orion", service_name="otelcol-contrib"} @1721123498

but the query returns a count of 0:

otelcol_processor_dropped_spans_total{cluster_name="orion",instance=":8888",job="otel-agent",processor="memory_limiter",service_instance_id="6b4xxxxx-fxxx-4xxx-axxx-e1fxxxxxxxxx",service_name="otelcol-contrib",service_version="0.104.0"} 0

The OTEL Gateway receives spans, logs and metrics exported by agents running on multiple K8s clusters. On one of the K8s clusters I experienced data loss on a traces pipeline. Using LogQL I can confirm the dropped spans as below:

{namespace="monitoring", app="opentelemetry-collector", cluster_name="orion"} | json | level=~"error|warn" | ts=~"^1721123498.*"

and the output:

{"level":"error","ts":1721123498.01548,"caller":"exporterhelper/queue_sender.go:90","msg":"Exporting failed. Dropping data.","kind":"exporter","data_type":"traces","name":"zipkin/tempo","error":"no more retries left: failed the request with status code 413","dropped_items":2393,"stacktrace":"go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1\n\tgo.opentelemetry.io/collector/[email protected]/exporterhelper/queue_sender.go:90\ngo.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume\n\tgo.opentelemetry.io/collector/[email protected]/internal/queue/bounded_memory_queue.go:52\ngo.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1\n\tgo.opentelemetry.io/collector/[email protected]/internal/queue/consumers.go:43"}

What's strange is that when I use the otelcol_processor_refused_spans_total metric:

otelcol_processor_refused_spans_total{cluster_name="orion", service_name="otelcol-contrib"} @1721123498

I get some results:

otelcol_processor_refused_spans_total{cluster_name="orion",instance=":8888",job="otel-agent",processor="memory_limiter",service_instance_id="6bXXXXXX-fXXX-4XXX-aXXX-e1fXXXXXXXXX",service_name="otelcol-contrib",service_version="0.104.0"} 38111

Although this metric may work for alerting I would ideally expect to get results from the more specific otelcol_processor_dropped_spans_total metric.

What am I missing ?

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@atoulme
Copy link
Contributor

atoulme commented Oct 12, 2024

The status code returned is the clue here. 413 means entity too large. The spans were explicitly refused. This is the correct behavior.

@atoulme atoulme removed the needs triage New item requiring triage label Oct 12, 2024
@github-actions github-actions bot removed the Stale label Oct 12, 2024
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

@github-actions github-actions bot added the Stale label Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants