chore (sync service): probablistic sampling of transaction spans #2054

kevin-dp · 2024-11-27T11:03:26Z

Part of #2032.

We noticed that most spans are descendants of the pg_txn.replication_client.process_x_log_data root span.
Therefore, we decided to only sample a portion of those spans.

This PR introduces a custom sampler that works as follows:

Samples a configurable ratio of pg_txn.replication_client.process_x_log_data root spans
Samples all other root spans
Child spans are sampled if their parent is sampled

Problem

We would like to sample all errors.
To do this we need to make the sampling decision at the end of the span when we have all attributes and events because errors are recorded using an "exception" event, this is known as tail sampling. However, the Erlang opentelemetry library only seems to support head sampling:

Sampling is performed at span creation time by the Sampler configured on the Tracer

cf. https://docs.honeycomb.io/manage-data-volume/sample/techniques/ if you're not familiar with head vs tail sampling.

EDIT: solving this problem may require using HoneyComb's "Refinery" mechanism. So Electric would sample all traces and we would setup Refinery with custom tail sampling logic. However, this requires extra infrastructure to set up.

balegas · 2024-11-27T11:47:44Z

Too bad that erlang otel doesn't support it yet. Do they have plans to support it?

kevin-dp · 2024-11-27T12:56:20Z

Comments from a discussion with @alco:

Sampling is performed in :otel_span_utils:start_span() in the opentelemetry-erlang library. There's also the end_span() function to go with it. It will take a bit of hacking but I think we can make tail sampling work there.

It might also be possible to avoid modifying the library and instead manipulate span contexts in our application. This is dependent on opentelemetry-erlang's current implementation but a potential solution could look as follows:

write a sampler that returns the ?RECORD_ONLY constant for samples that should be dropped
modify our OpenTelemetry.record_exception() function to overwrite trace_flags on the current span's context and mark it as sampled regardless of the sampling decision made when the span was started

The above is based on the observation that opentelemetry-erlang only checks trace_flags when a span ends to decide whether it needs to be exported or dropped. We'd likely need to walk the span hierarchy to update trace_flags on all parent spans of the current one.

An alternative route to implementing tail sampling would be to run an OT collector as a proxy between Electric and Honeycomb: we could require the use of an OT Collector in case one wants to have advanced sampling that always exports errors. But out of the box, if you're running Electric without an OT collector, you'd only get head sampling.

Note: Honeycomb has its own application called Refinery that serves as a tail sampling proxy for traces. But it is specific to Honeycomb and doesn't have other more general functions that the OT Collector provides.

alco · 2024-12-02T10:31:14Z

To add to the PR description: seeing lots of spans rooted in pg_txn.replication_client.process_x_log_data means the underlying database is experiencing a high write rate. It's perfectly fine to sample those spans since we don't have any error recording logic in them.

We only record errors for the shape_get.plug.serve_snapshot and shape_get.plug.serve_shape_log spans right now. We can use conditional sampling to only sample the spans that cover the replication stream and exclude shape serving.

kevin-dp · 2024-12-02T11:41:25Z

To add to the PR description: seeing lots of spans rooted in pg_txn.replication_client.process_x_log_data means the underlying database is experiencing a high write rate. It's perfectly fine to sample those spans since we don't have any error recording logic in them.

We only record errors for the shape_get.plug.serve_snapshot and shape_get.plug.serve_shape_log spans right now. We can use conditional sampling to only sample the spans that cover the replication stream and exclude shape serving.

I'm going to merge this PR since this is already what this PR does 👍

kevin-dp added 2 commits November 27, 2024 10:40

Use custom OTEL sampler

285e681

Probabilistic sampling of TX spans.

Loading
Loading status checks…

34c00dd

kevin-dp mentioned this pull request Dec 2, 2024

chore (sync service): remove redundant spans #2037

Merged

alco approved these changes Dec 2, 2024

View reviewed changes

kevin-dp merged commit 1f2a5ff into main Dec 2, 2024
26 checks passed

kevin-dp deleted the kevin/custom-sampler branch December 2, 2024 11:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore (sync service): probablistic sampling of transaction spans #2054

chore (sync service): probablistic sampling of transaction spans #2054

kevin-dp commented Nov 27, 2024 •

edited

Loading

balegas commented Nov 27, 2024

kevin-dp commented Nov 27, 2024 •

edited by alco

Loading

alco commented Dec 2, 2024

kevin-dp commented Dec 2, 2024

chore (sync service): probablistic sampling of transaction spans #2054

chore (sync service): probablistic sampling of transaction spans #2054

Conversation

kevin-dp commented Nov 27, 2024 • edited Loading

Problem

balegas commented Nov 27, 2024

kevin-dp commented Nov 27, 2024 • edited by alco Loading

alco commented Dec 2, 2024

kevin-dp commented Dec 2, 2024

kevin-dp commented Nov 27, 2024 •

edited

Loading

kevin-dp commented Nov 27, 2024 •

edited by alco

Loading