Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Tracing #2032

Open
balegas opened this issue Nov 22, 2024 · 4 comments
Open

Better Tracing #2032

balegas opened this issue Nov 22, 2024 · 4 comments

Comments

@balegas
Copy link
Contributor

balegas commented Nov 22, 2024

We have set up basic Tracing for Electric but we want to continue improving what we become more successful investigating issues.

  • We're following our intuition a bit and we will continue. Having real traces that we need to investigar help us focus on the things that need more observability
  • in some cases we're abusing the use of spans which migh lead to high volumes of traces (these are billed). We should try to replace some traces for attributes and reserve the finer details for a TRACE mode that we can activate to increase the level of details to investigate incidents
@balegas balegas added this to the Production Readiness milestone Nov 22, 2024
@balegas
Copy link
Contributor Author

balegas commented Nov 24, 2024

note #1738

@balegas
Copy link
Contributor Author

balegas commented Nov 25, 2024

in some cases we're abusing the use of spans which migh lead to high volumes of traces (these are billed). We should try to replace some traces for ...

Screenshot 2024-11-25 at 09 13 24

We want to be able to turn-on this level of detail if we need to investigate any occurrences, but during normal execution we should zoom out on some of the details and put more details into the attributes. I'm not sure if there is any pattern to do that with OTEL, otherwise we could put them behind a flag.

For the top-level number of transactions we handle, the developer can setup sampling

@robacourt
Copy link
Contributor

We should try to replace some traces for attributes and reserve the finer details for a TRACE mode that we can activate to increase the level of details to investigate incidents

Beware of making the code complicated in order to achieve this!

A great way to reduce the number of traces is Sampling: https://opentelemetry.io/docs/concepts/sampling/ You can reduce traffic 10/100 fold while not losing detail.

kevin-dp added a commit that referenced this issue Dec 2, 2024
Part of #2032.
I removed 2 spans that seem redundant:

- `shape_write.log_collector.handle_txn`: this span wraps the
`handle_transaction` function. However, there is already a span
`pg_txn.replication_client.transaction_received` that in fact calls into
`handle_transaction`.
- `shape_write.log_collector.handle_relation`: this span wraps the
handling of relation messages. Similarly to transactions, there is a
`pg_txn.replication_client.relation_received` span that ends up calling
into the `handle_relation` function.

Here are the relevant code snippets:

`Electric.Postgres.ReplicationClient`:
```ex
{m, f, args} = state.transaction_received

OpenTelemetry.with_span(
  "pg_txn.replication_client.transaction_received",
  [num_changes: length(txn.changes), num_relations: MapSet.size(txn.affected_relations)],
  fn -> apply(m, f, [txn | args]) end
)
```

The call to `apply` is a chain of calls that eventually ends up calling
`handle_transaction` (and does nothing more):

`Electric.StackSupervisor`:
```ex
transaction_received: {Electric.Replication.ShapeLogCollector, :store_transaction, [shape_log_collector]}
```

`Electric.Replication.ShapeLogCollector.ex`:
```ex
def store_transaction(%Transaction{} = txn, server) do
  ot_span_ctx = OpenTelemetry.get_current_context()
  GenStage.call(server, {:new_txn, txn, ot_span_ctx}, :infinity)
end

def handle_call({:new_txn, %Transaction{xid: xid, lsn: lsn} = txn, ot_span_ctx}, from, state) do
  OpenTelemetry.set_current_context(ot_span_ctx)
  Logger.info("Received transaction #{xid} from Postgres at #{lsn}")
  Logger.debug(fn -> "Txn received in ShapeLogCollector: #{inspect(txn)}" end)

  OpenTelemetry.with_span("shape_write.log_collector.handle_txn", [], fn ->
    handle_transaction(txn, from, state) 
  end)
end
```

### Question

I removed the spans from `Electric.Replication.ShapeLogCollector`,
another option would be to remove the spans from
`Electric.Postgres.ReplicationClient`, any preference here?
kevin-dp added a commit that referenced this issue Dec 2, 2024
Part of #2032.

We noticed that most spans are descendants of the
`pg_txn.replication_client.process_x_log_data` root span.
Therefore, we decided to only sample a portion of those spans.

This PR introduces a custom sampler that works as follows:
- Samples a configurable ratio of
`pg_txn.replication_client.process_x_log_data` root spans
- Samples all other root spans
- Child spans are sampled if their parent is sampled

### Problem

We would like to sample all errors.
To do this we need to make the sampling decision at the end of the span
when we have all attributes and events because errors are recorded using
an "exception" event, this is known as tail sampling. However, the
Erlang opentelemetry library only seems to support head sampling:
> Sampling is performed at span creation time by the Sampler configured
on the Tracer

cf. https://docs.honeycomb.io/manage-data-volume/sample/techniques/ if
you're not familiar with head vs tail sampling.

EDIT: solving this problem may require using HoneyComb's "Refinery"
mechanism. So Electric would sample all traces and we would setup
Refinery with custom tail sampling logic. However, this requires extra
infrastructure to set up.
@KyleAMathews
Copy link
Contributor

@balegas more work needed here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants