Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/tailsamplingprocessor] and policy appears to ignore string_attribute match results #9553

Closed
tillig opened this issue Apr 26, 2022 · 6 comments · Fixed by #9768
Closed
Labels
bug Something isn't working

Comments

@tillig
Copy link
Contributor

tillig commented Apr 26, 2022

Describe the bug

Given a tailsamplingprocessor with an and policy that has both string_attribute and probabilistic sampling, the and policy appears to ignore the results of the string_attribute match and falls back on only the probabilistic results.

I have a service in Kubernetes that is getting scraped with Prometheus. I am trying to configure tail sampling to:

  • Keep all "slow" calls (those that take > 5s)
  • Keep all error calls (either the trace status is ERROR or the response code >= 500)
  • For the remaining calls...
    • Drop all the "noise" - /metrics, /health, GRPC calls to export traces (where that value is in the http.url attribute)
    • Sample 10% of the calls that aren't noise and keep those

The policies to keep slow and error calls work perfectly. However, I still get ~10% of the calls to /metrics despite having an and policy set up to combine string_attribute with probabilistic sampling.

Steps to reproduce

In Kubernetes:

  • Deploy one instance of OpenTelemetry Collector via the OpenTelemetry Operator. Use the otel/opentelemetry-collector-contrib container. Relay traces to an endpoint like Jaeger or Dynatrace (I'm using Dynatrace).
  • Deploy a test service that will get scraped with Prometheus and also offer endpoints for general consumption.
  • Cause an error - note the trace gets forwarded to the endpoint as expected.
  • Wait while Prometheus scrapes the metrics, about 10 - 15 minutes. You'll see traces from /metrics show up in the trace UI, which indicates the tailsamplingprocessor was not properly dropping the matching traces.

See config section below for configuration.

What did you expect to see?

I expect to see no traces from /metrics because the string_attribute policy should filter them out.

What did you see instead?

I see periodic traces from /metrics coming through, with approximately the 10% sampling expected from the probabilistic policy.

Here is a screenshot of one of the traces that snuck through OpenTelemetry Collector and into Dynatrace. You can see here that:

  • The operation completed in 1.58ms, far lower than the 5000ms threshold for the latency policy in my config.
  • The trace was a success, so it wasn't captured based on the status_code policy.
  • The http.url trace attribute is clearly present and there is only one trace span in the whole thing, so it wasn't a problem with too short of a tail sampling decision_wait or with the attribute missing.

Example trace that got through the tailsamplingprocessor

What version did you use?

Version: 0.48.0 of the otel/opentelemetry-collector-contrib container.

What config did you use?

extensions:
  memory_ballast:
    size_mib: 683
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
exporters:
  otlphttp/dynatrace:
    endpoint: "${DYNATRACE_ENDPOINT}/api/v2/otlp"
    headers:
      Authorization: "Api-Token ${DYNATRACE_API_TOKEN}"
processors:
  batch/trace:
  memory_limiter/trace:
    limit_mib: 1500
    spike_limit_mib: 512
    check_interval: 5s
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: slow_calls
        type: latency
        latency:
          threshold_ms: 5000
      - name: error_otel_status
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: error_http_status
        type: numeric_attribute
        numeric_attribute:
            key: http.status_code
            min_value: 500
      - name: keep_sampled_success
        type: and
        and:
          and_sub_policy:
            - name: drop_noisy_traces_url
              type: string_attribute
              string_attribute:
                key: http.url
                values:
                  - \/metrics
                  - opentelemetry\.proto
                  - favicon\.ico
                  - \/health
                enabled_regex_matching: true
                invert_match: true
            - name: keep_percentage
              type: probabilistic
              probabilistic:
                sampling_percentage: 10
service:
  extensions:
    - memory_ballast
  pipelines:
    traces:
      receivers:
        - otlp
      processors:
        - memory_limiter/trace
        - tail_sampling
        - batch/trace
      exporters:
        - otlphttp/dynatrace

Environment

OS: Ubuntu 18.04.6 (the Kubernetes node OS)
Docker container: otel/opentelemetry-collector-contrib 0.48.0

Additional context

There really isn't any good mechanism for debugging the processor. I did turn the logging up to debug level to see what got logged, but all I saw was counts of trace spans that were getting through. I did see the string_attributes policy executing, but it didn't seem to make any difference.

Here are the logs from one of the /metrics calls that got through. Nothing else at all was running through the system (it's a standalone POC) so I know this is the /metrics trace. I can see it passes through latency.go, status_code.go, string_tag_filter.go, and then probabilistic.go and finally through the oltphttpexporter/oltp.go to get to Dynatrace. I didn't expect to see it make it through the string_tag_filter.go to be honest.

2022-04-26T23:17:48.552Z	debug	[email protected]/processor.go:160	Sampling Policy Evaluation ticked	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces"}
2022-04-26T23:17:48.552Z	debug	sampling/latency.go:39	Evaluating spans in latency filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces"}
2022-04-26T23:17:48.552Z	debug	sampling/status_code.go:62	Evaluating spans in status code filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces"}
2022-04-26T23:17:48.552Z	debug	sampling/string_tag_filter.go:103	Evaluting spans in string-tag filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces"}
2022-04-26T23:17:48.552Z	debug	sampling/probabilistic.go:55	Evaluating spans in probabilistic filter	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces"}
2022-04-26T23:17:48.552Z	debug	[email protected]/processor.go:198	Sampling policy evaluation completed	{"kind": "processor", "name": "tail_sampling", "pipeline": "traces", "batch.len": 1, "sampled": 4, "notSampled": 0, "droppedPriorToEvaluation": 0, "policyEvaluationErrors": 0}
2022-04-26T23:17:48.634Z	debug	otlphttpexporter/otlp.go:127	Preparing to make HTTP request	{"kind": "exporter", "name": "otlphttp/dynatrace", "url": "https://bmb58337.live.dynatrace.com/api/v2/otlp/v1/traces"}
2022-04-26T23:17:49.540Z	debug	memorylimiterprocessor/memorylimiter.go:288	Currently used memory.	{"kind": "processor", "name": "memory_limiter/trace", "pipeline": "traces", "cur_mem_mib": 13}

I do see the example configurations in the tailsamplingprocessor use a sort of combination of YAML and JSON; I tested it both ways in case that mattered and it doesn't.

I tried reducing the number of values to be only \/metrics but that didn't stop it from getting through, either. I tried switching to use the http.target attribute (since that's also present) but it didn't change anything.

I could be misunderstanding how the and policy works, so let me know if I am. Perhaps there's some other way to configure effectively the same thing, like with the composite policy, but I'm not entirely sure how.

@tillig tillig added the bug Something isn't working label Apr 26, 2022
@jpkrohling
Copy link
Member

@mottibec, I'm not able to take a look at this for a couple of months, would you be able to take a look?

@mottibec
Copy link
Contributor

Hey @tillig, unfortunately, the and policy cannot support the invert match atm. as a quick workaround, I can suggest you implement the same functionality using a negative match with regex. so for example the rule for metrics would be changed to something like this ^(?!metrics).*$

@tillig
Copy link
Contributor Author

tillig commented Apr 27, 2022

OOOOOH that makes sense, then. Might be good to include that in the README somewhere, it had me totally stumped. I'll switch to this and see how she goes. Thanks!

@tillig
Copy link
Contributor Author

tillig commented Apr 27, 2022

It appears negative lookahead is not supported, either. I tried using (?!\/metrics) as the expression (to ensure it's /metrics and not /some-metrics-thing). The pod won't start and the log says:

panic: regexp: Compile(`(?!\/metrics)`): error parsing regexp: invalid or unsupported Perl syntax: `(?!`

Looks like I may be stuck until the and policy supports the inverted match, unless there's something I'm misunderstanding as to how to make this regex compile. For example, I've seen you have to double $$ in a regex in some places because $ in the config is a meta-character. Is there something I'm not escaping correctly?

        - name: keep_sampled_success
          type: and
          and:
            and_sub_policy:
              - name: drop_noisy_traces_url
                type: string_attribute
                string_attribute:
                  key: http.url
                  values:
                    - (?!\/metrics)
                  enabled_regex_matching: true
              - name: keep_percentage
                type: probabilistic
                probabilistic:
                  sampling_percentage: 10

@mottibec
Copy link
Contributor

mottibec commented May 2, 2022

Sadly the go regex engine does not support negative lookahead.
I will try to get to this next week, but if you want to try and work on this yourself you are very welcome to submit a PR.

😄

@tillig
Copy link
Contributor Author

tillig commented May 6, 2022

@mottibec I put in PR #9768 to add invert support. It... seemed really simple, like, one-liner simple. So... yay! 🎉

tillig added a commit to tillig/opentelemetry-collector-contrib that referenced this issue May 26, 2022
djaglowski pushed a commit that referenced this issue May 26, 2022
…ch (#9768)

* [processor/tailsamplingprocessor] and support for string invert

* Add #9553 to changelog.
kentquirk pushed a commit to McSick/opentelemetry-collector-contrib that referenced this issue Jun 14, 2022
…ch (open-telemetry#9768)

* [processor/tailsamplingprocessor] and support for string invert

* Add open-telemetry#9553 to changelog.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants