Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore (sync service): probablistic sampling of transaction spans #2054

Merged
merged 2 commits into from
Dec 2, 2024

Conversation

kevin-dp
Copy link
Contributor

@kevin-dp kevin-dp commented Nov 27, 2024

Part of #2032.

We noticed that most spans are descendants of the pg_txn.replication_client.process_x_log_data root span.
Therefore, we decided to only sample a portion of those spans.

This PR introduces a custom sampler that works as follows:

  • Samples a configurable ratio of pg_txn.replication_client.process_x_log_data root spans
  • Samples all other root spans
  • Child spans are sampled if their parent is sampled

Problem

We would like to sample all errors.
To do this we need to make the sampling decision at the end of the span when we have all attributes and events because errors are recorded using an "exception" event, this is known as tail sampling. However, the Erlang opentelemetry library only seems to support head sampling:

Sampling is performed at span creation time by the Sampler configured on the Tracer

cf. https://docs.honeycomb.io/manage-data-volume/sample/techniques/ if you're not familiar with head vs tail sampling.

EDIT: solving this problem may require using HoneyComb's "Refinery" mechanism. So Electric would sample all traces and we would setup Refinery with custom tail sampling logic. However, this requires extra infrastructure to set up.

@balegas
Copy link
Contributor

balegas commented Nov 27, 2024

Too bad that erlang otel doesn't support it yet. Do they have plans to support it?

@kevin-dp
Copy link
Contributor Author

kevin-dp commented Nov 27, 2024

Comments from a discussion with @alco:

Sampling is performed in :otel_span_utils:start_span() in the opentelemetry-erlang library. There's also the end_span() function to go with it. It will take a bit of hacking but I think we can make tail sampling work there.

It might also be possible to avoid modifying the library and instead manipulate span contexts in our application. This is dependent on opentelemetry-erlang's current implementation but a potential solution could look as follows:

  • write a sampler that returns the ?RECORD_ONLY constant for samples that should be dropped
  • modify our OpenTelemetry.record_exception() function to overwrite trace_flags on the current span's context and mark it as sampled regardless of the sampling decision made when the span was started

The above is based on the observation that opentelemetry-erlang only checks trace_flags when a span ends to decide whether it needs to be exported or dropped. We'd likely need to walk the span hierarchy to update trace_flags on all parent spans of the current one.

An alternative route to implementing tail sampling would be to run an OT collector as a proxy between Electric and Honeycomb: we could require the use of an OT Collector in case one wants to have advanced sampling that always exports errors. But out of the box, if you're running Electric without an OT collector, you'd only get head sampling.

Note: Honeycomb has its own application called Refinery that serves as a tail sampling proxy for traces. But it is specific to Honeycomb and doesn't have other more general functions that the OT Collector provides.

@alco
Copy link
Member

alco commented Dec 2, 2024

To add to the PR description: seeing lots of spans rooted in pg_txn.replication_client.process_x_log_data means the underlying database is experiencing a high write rate. It's perfectly fine to sample those spans since we don't have any error recording logic in them.

We only record errors for the shape_get.plug.serve_snapshot and shape_get.plug.serve_shape_log spans right now. We can use conditional sampling to only sample the spans that cover the replication stream and exclude shape serving.

@kevin-dp
Copy link
Contributor Author

kevin-dp commented Dec 2, 2024

To add to the PR description: seeing lots of spans rooted in pg_txn.replication_client.process_x_log_data means the underlying database is experiencing a high write rate. It's perfectly fine to sample those spans since we don't have any error recording logic in them.

We only record errors for the shape_get.plug.serve_snapshot and shape_get.plug.serve_shape_log spans right now. We can use conditional sampling to only sample the spans that cover the replication stream and exclude shape serving.

I'm going to merge this PR since this is already what this PR does 👍

@kevin-dp kevin-dp merged commit 1f2a5ff into main Dec 2, 2024
26 checks passed
@kevin-dp kevin-dp deleted the kevin/custom-sampler branch December 2, 2024 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants