Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tail_sampling drops otel.library.name span attribute #13642

Closed
ryepup opened this issue Aug 26, 2022 · 18 comments
Closed

tail_sampling drops otel.library.name span attribute #13642

ryepup opened this issue Aug 26, 2022 · 18 comments
Assignees
Labels
bug Something isn't working discussion needed Community discussion needed help wanted Extra attention is needed processor/tailsampling Tail sampling processor

Comments

@ryepup
Copy link

ryepup commented Aug 26, 2022

Describe the bug

Spans that pass through a tail_sampling processor (even always_sample) have their otel.library.name attribute removed.

Steps to reproduce

Create an otel-collector configured with two receivers, one that does tail_sampling and one that doesn't. Send a trace to each, using go.opentelemetry.io/otel. I've packaged this up with docker-compose in a gist.

  1. download files from https://gist.github.com/ryepup/d70eebd8b0e6bb1a4578ac7c06a842cf
  2. docker-compose up
  3. open grafana http://localhost:3000/explore
  4. go to the search tab, then hit "run query", you should see two traces

What did you expect to see?

Both traces have a otel.library.name attribute that matches the name given to TracerProvider.Tracer

What did you see instead?

The trace that went through a pipeline with tail_sampling did not have an otel.library.name

What version did you use?

dockerized otel/opentelemetry-collector-contrib:0.58.0

What config did you use?

https://gist.github.com/ryepup/d70eebd8b0e6bb1a4578ac7c06a842cf#file-otel-yaml

Environment

Seen in dockerized local development on osx and linux-based kubernetes on azure.

Additional context

My production code is passing through https://github.com/grafana/agent, but the problem is reproducible without that layer.

Other span attributes seem unaffected, even when added via another processor.

@ryepup ryepup added the bug Something isn't working label Aug 26, 2022
@jpkrohling
Copy link
Member

This is tricky: we are re-assembling the spans for a trace in a new ResourceSpans object, which likely came from different sources and with differing instrumentation libraries. We could reorganize things in a single pdata.Traces with several ResourceSpans inside them, one for each unique Resource+InstrumentationLibrary, but I feel like this would be highly inefficient.

I'm leaving this open so that we can hear other people's opinions.

@ryepup, would you be willing to perform the mentioned changes and check the performance difference?

@jpkrohling jpkrohling added the processor/tailsampling Tail sampling processor label Aug 29, 2022
@codeboten codeboten added discussion needed Community discussion needed and removed needs-discussion labels Sep 9, 2022
@paychex-ssmithrand
Copy link

paychex-ssmithrand commented Sep 13, 2022

@jpkrohling As a workaround, is there a processor that can access the InstrumentationScope information and set it as an attribute on each individual Span before the span gets to the Tail Sampling Processor? We, for example, are reliant on Instrumentation Library information in order to classify the Span in a custom exporter component we've developed.

@jpkrohling
Copy link
Member

Could you check whether the transform processor can read the instrumentation scope?

@parkedwards
Copy link

parkedwards commented Oct 19, 2022

Could you check whether the transform processor can read the instrumentation scope?

@jpkrohling if i'm not mistaken, it looks like this was recently updated to have access to this scope, at least via the OTTL?
#14892

Update: looks like this isn't yet in a released version (currently at 0.62.1)

@Aneurysm9
Copy link
Member

This is tricky: we are re-assembling the spans for a trace in a new ResourceSpans object, which likely came from different sources and with differing instrumentation libraries. We could reorganize things in a single pdata.Traces with several ResourceSpans inside them, one for each unique Resource+InstrumentationLibrary, but I feel like this would be highly inefficient.

Isn't this precisely what the component should be doing? If a sampler is losing information on spans that are not dropped that seems like a problem.

@parkedwards
Copy link

parkedwards commented Oct 20, 2022

im also wondering if this is less that the span attribute drops, and more that the telemetry loses view of the instrumentation scope itself when being tail-sampled

this is a tough tradeoff for us, because we'd like to do some trace-specific sampling at the tail, but then it means that we'll lose instrumentation granularity when the traces are assembled

@jpkrohling
Copy link
Member

jpkrohling commented Oct 20, 2022

If a sampler is losing information on spans that are not dropped, that seems like a problem.

That's true. I'm not sure what the original engineer had in mind when this component was written, but I believe they decided to gain some performance by avoiding the extra processing of reorganizing this data.

I would be open to reviewing a PR with a proposal to change the current behavior. Bonus points if it could include a benchmark so that we can understand what we are giving up in terms of performance.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Dec 20, 2022
@fatsheep9146 fatsheep9146 added help wanted Extra attention is needed and removed Stale labels Dec 20, 2022
@roryharness
Copy link

This caused major confusion and issues exporting traces & APM data to Datadog.

It appears that Operation names & APM metrics trace.* use the library in the naming which follows a standard of trace.<otel.library>.<metric> Examples below
trace.opentelemetry_ecto.client => trace.opentelemetry.client
trace.opentelemetry_tesla.client => trace.opentelemetry.client
trace.opentelemetry_redix.client => trace.opentelemetry.client
This was not only confusing but ultimately combined metrics from all clients making the APM metrics not accurate.

We were able to solve this by Implementing the DatdogProcessor but i would love to see this get fixed so operation name in traces and metrics line up again.

@jpkrohling
Copy link
Member

@roryharness, agree, this would be great to have. From my previous message:

I would be open to reviewing a PR with a proposal to change the current behavior. Bonus points if it could include a benchmark so that we can understand what we are giving up in terms of performance.

@parkedwards
Copy link

@roryharness out of curiosity - what was your setup to retain the otel library name while tail sampling?

@paychex-ssmithrand
Copy link

@roryharness out of curiosity - what was your setup to retain the otel library name while tail sampling?

The workaround we implemented was to add a custom span processor that is used by the SDK in code. This span processor takes the instrumentation library name and creates a Span Attribute on each Span with the value:

onStart(span, parentContext) {
    span.setAttribute("instrumentationLibrary.name",span.instrumentationLibrary.name)
  }

It is not elegant but it works for now. If you really really need the data to be there by the time it gets to tail_sampling...

@parkedwards
Copy link

parkedwards commented May 12, 2023

thanks @paychex-ssmithrand - i was able to do the same, but uncovering some other interesting behaviors through this experiment

for context, i'm using the tailsampling processor similarly to @roryharness - i'm collecting spans (from the python OTEL SDK) through the Collector, passing them into a pipeline with the processor/datadog for accurate APM stats -> processor/tailsampling -> exporter/datadog. the processor/datadog solves the metric name misalignment that @roryharness mentioned

however the spans themselves appear to be missing some span level metadata after passing through the process/tailsampling - it's unclear to me if it's span.name (this is obscured from me in the datadog layer), but the end result is that spans in the vendor are being labelled generically, thru some default value instead of what we set at the library level. i've ruled out that it's only otel.library.name bc I added a custom span processor at the SDK level to explicitly set that property + I see it on the vendor side, though with the same default span name behavior

# SDK level
name = "my-custom-span-name"
with tracer.start_as_current_span(name):
  ...

# After tail sampling, the span name in the vendor shows as:
opentelemetry.internal

obviously this is dependent on the way that datadog (the vendor) parses these spans, but I'm trying to better understand which span attributes/metadata is being lost in the processor/tailsampling other than otel.library.name. after which point, i may be able to dig into the processor code to see how to rectify this

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jul 12, 2023
@jpkrohling
Copy link
Member

but uncovering some other interesting behaviors through this experiment

Can you open an issue for that? I could probably look into that myself.

@siriusfreak
Copy link
Contributor

siriusfreak commented Aug 11, 2023

Hello everyone! I have prepared a fix for this problem. This fix was already tested in the production environment under load 1k rps and all looks good.

Please review and add comments! I will be happy to contribute to this project!

PR: #25117

@jpkrohling
Copy link
Member

I reviewed this and looks good to me, but you might need to sort out the CLA check.

TylerHelmuth pushed a commit that referenced this issue Aug 31, 2023
…info (#25117)

**Description:** Added saving instrumentation library information. 
Fix issue:
#13642

How solved: 
1. Add saving information about instrumentation scope for traces:
processor/tailsamplingprocessor/processor.go:307
2. Sampled spans will now be put into the previous version of the
instrumentation scope: processor/tailsamplingprocessor/processor.go:435


**Testing:** on production with high load. Unit tests in progress

**Documentation:** no documentation changes
@songy23
Copy link
Member

songy23 commented Aug 31, 2023

Fixed by #25117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion needed Community discussion needed help wanted Extra attention is needed processor/tailsampling Tail sampling processor
Projects
None yet
Development

No branches or pull requests