Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore (sync service): remove redundant spans #2037

Merged
merged 2 commits into from
Dec 2, 2024

Conversation

kevin-dp
Copy link
Contributor

@kevin-dp kevin-dp commented Nov 25, 2024

Part of #2032.
I removed 2 spans that seem redundant:

  • shape_write.log_collector.handle_txn: this span wraps the handle_transaction function. However, there is already a span pg_txn.replication_client.transaction_received that in fact calls into handle_transaction.
  • shape_write.log_collector.handle_relation: this span wraps the handling of relation messages. Similarly to transactions, there is a pg_txn.replication_client.relation_received span that ends up calling into the handle_relation function.

Here are the relevant code snippets:

Electric.Postgres.ReplicationClient:

{m, f, args} = state.transaction_received

OpenTelemetry.with_span(
  "pg_txn.replication_client.transaction_received",
  [num_changes: length(txn.changes), num_relations: MapSet.size(txn.affected_relations)],
  fn -> apply(m, f, [txn | args]) end
)

The call to apply is a chain of calls that eventually ends up calling handle_transaction (and does nothing more):

Electric.StackSupervisor:

transaction_received: {Electric.Replication.ShapeLogCollector, :store_transaction, [shape_log_collector]}

Electric.Replication.ShapeLogCollector.ex:

def store_transaction(%Transaction{} = txn, server) do
  ot_span_ctx = OpenTelemetry.get_current_context()
  GenStage.call(server, {:new_txn, txn, ot_span_ctx}, :infinity)
end

def handle_call({:new_txn, %Transaction{xid: xid, lsn: lsn} = txn, ot_span_ctx}, from, state) do
  OpenTelemetry.set_current_context(ot_span_ctx)
  Logger.info("Received transaction #{xid} from Postgres at #{lsn}")
  Logger.debug(fn -> "Txn received in ShapeLogCollector: #{inspect(txn)}" end)

  OpenTelemetry.with_span("shape_write.log_collector.handle_txn", [], fn ->
    handle_transaction(txn, from, state) 
  end)
end

Question

I removed the spans from Electric.Replication.ShapeLogCollector, another option would be to remove the spans from Electric.Postgres.ReplicationClient, any preference here?

@kevin-dp kevin-dp requested a review from alco November 25, 2024 15:47
@alco
Copy link
Member

alco commented Nov 27, 2024

I wouldn't call those spans redundant because they cover code executed in different processes.

My original idea with these was to have a way to tell if there's any significant delay between receiving a transaction in the replication client process and storing it in shape log collectors. Given our architecture of running a separate log collector for each shape, a single transaction may be stored N times and the time profile of that could like like N parallel child spans or a waterfall of N spans. What I mean is that looking at a trace could lend us insights into potential inefficiencies in the transaction processing architecture.

I haven't spent much time testing this, though. If removing those spans still maintains a hierarchical relation between the root pg_txn.replication_client.transaction_received span and the leaf shape_write.consumer.handle_txns spans started in each shape consumer, then it's fine to remove this intermediate shape_write.log_collector.handle_txn span.

@alco
Copy link
Member

alco commented Dec 2, 2024

Nevermind my above comment. I've double-checked that the shape_write.consumer.handle_txns span doesn't have a hierarchical relation with the pg_txn.replication_client.transaction_received span. So removing the two intermediate spans here doesn't make anything worse in terms of observability for us.

Copy link
Member

@alco alco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@balegas
Copy link
Contributor

balegas commented Dec 2, 2024

My original idea with these was to have a way to tell if there's any significant delay between receiving a transaction in the replication client process and storing it in shape log collectors.

I think it's important to not lose visibility about scheduling delays, but would we be able to track the time without adding the span? For example, could we update metrics of p.50, .p99 latency on the parent span? i.e. update the latency attributes in transaction_received when the child process starts.

For context, it's almost half of the spans if we handle the top two.
Screenshot 2024-12-02 at 10 41 42

@kevin-dp
Copy link
Contributor Author

kevin-dp commented Dec 2, 2024

My original idea with these was to have a way to tell if there's any significant delay between receiving a transaction in the replication client process and storing it in shape log collectors.

I think it's important to not lose visibility about scheduling delays, but would we be able to track the time without adding the span? For example, could we update metrics of p.50, .p99 latency on the parent span? i.e. update the latency attributes in transaction_received when the child process starts.

For context, it's almost half of the spans if we handle the top two. Screenshot 2024-12-02 at 10 41 42

@balegas
This PR removes redundant spans. Don't think we have to bubble up anything here.
I guess your comment was meant for #2054 ?

@balegas
Copy link
Contributor

balegas commented Dec 2, 2024

This PR removes redundant spans. Don't think we have to bubble up anything here.
I guess your comment was meant for #2054

I don't think #2054 does what I'm asking, but you can clear it out to me. In #2054 you're defining a sampling strategy for root spans, which will apply for descendant spans. Which means that if we sample process_x_log_data we also sample decode_message, which is still double of the spans we need.

I think decode_message time should always coalesced into parents span

@kevin-dp
Copy link
Contributor Author

kevin-dp commented Dec 2, 2024

This PR removes redundant spans. Don't think we have to bubble up anything here.
I guess your comment was meant for #2054

I don't think #2054 does what I'm asking, but you can clear it out to me. In #2054 you're defining a sampling strategy for root spans, which will apply for descendant spans. Which means that if we sample process_x_log_data we also sample decode_message, which is still double of the spans we need.

I think decode_message time should always coalesced into parents span

Sure, we can do that in a separate PR

@kevin-dp kevin-dp merged commit 0dc844f into main Dec 2, 2024
26 checks passed
@kevin-dp kevin-dp deleted the kevin/rmv-redundant-spans branch December 2, 2024 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants