Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

electric crashing/restarting - possibly due to consumer with an array where filter #1911

Closed
ericallam opened this issue Oct 31, 2024 · 8 comments

Comments

@ericallam
Copy link

ericallam commented Oct 31, 2024

Our 0.7.5 electric server is crashing periodically, and I think it's related to a consumer with an array where clause, when that consumer encounters a change with the column being nil. Here are some logs to show what's happening:

23:03:47.770 [info] Received transaction 2755874852 from Postgres at 36E/8CAF49B0
23:03:47.770 [debug] Txn received in ShapeLogCollector: %Electric.Replication.Changes.Transaction{xid: 2755874852, commit_timestamp: ~U[2024-10-31 13:32:40.883363Z], lsn: #Lsn<36E/8CAF49B0>, last_log_offset: LogOffset.new(3773341583792, 0), changes: [%Electric.Replication.Changes.UpdatedRecord{relation: {"public", "TaskRun"}, old_record: %{}
23:03:47.773 [error] :gen_statem #PID<0.1224622.0> terminating
** (DBConnection.ConnectionError) ssl send: closed
    (stdlib 6.0.1) gen_statem.erl:3242: :gen_statem.loop_state_callback_result/11
    (stdlib 6.0.1) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Queue: [info: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.5282>, :tls_connection, :undefined}, [#PID<0.1222172.0>, #PID<0.1222144.0>]}, <<100, 0, 0, 13, 1, 119, 0, 0, 3, 110, 140, 11, 97, 160, 0, 0, 3, 110, 140, 11, 97, 160, 0, 2, 200, 204, 147, 101, 186, 240, 85, 0, 125, 140, 1, 79, 0, 52, 116, 0, 0, 0, 25, 99, 109, 49, ...>>}]
State: {:no_state, %Postgrex.ReplicationConnection{protocol: %Postgrex.Protocol{sock: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.5282>, :tls_connection, :undefined}, [#PID<0.1222172.0>, #PID<0.1222144.0>]}}, connection_id: 2405, connection_key: 1939667444, peer: {{172, 16, 14, 132}, 5432}, types: {Postgrex.DefaultTypes, #Reference<0.3188593347.4200988677.234352>}, null: nil, timeout: 15000, ping_timeout: 15000, parameters: #Reference<0.3188593348.406061057.219093>, queries: #Reference<0.3188593348.406192129.219089>, postgres: :idle, transactions: :naive, buffer: :active_once, disconnect_on_error_codes: [], scram: %{iterations: 4096, auth_message: "n=,r=XFGH5yY3GLJoMCDm2/6tT3w6,r=XFGH5yY3GLJoMCDm2/6tT3w6//Wbfl3Fm5y9TyqtJoXpfMZe,s=7QZ2YWuj3HLZ02xJ27ECDw==,i=4096,c=biws,r=XFGH5yY3GLJoMCDm2/6tT3w6//Wbfl3Fm5y9TyqtJoXpfMZe", salt: <<237, 6, 118, 97, 107, 163, 220, 114, 217, 211, 108, 73, 219, 177, 2, 15>>}, disable_composite_types: false, messages: []}, state: {Electric.Postgres.ReplicationClient, %Electric.Postgres.ReplicationClient.State{transaction_received: {Electric.Replication.ShapeLogCollector, :store_transaction, [{:via, Registry, {Electric.Registry.Processes, {Electric.Replication.ShapeLogCollector, :default}}}]}, relation_received: {Electric.Replication.ShapeLogCollector, :handle_relation_msg, [{:via, Registry, {Electric.Registry.Processes, {Electric.Replication.ShapeLogCollector, :default}}}]}, publication_name: "electric_publication_trigger", try_creating_publication?: true, start_streaming?: false, slot_name: "electric_slot_trigger", display_settings: [], origin: "postgres", txn_collector: %Electric.Postgres.ReplicationClient.Collector{transaction: nil, tx_op_index: nil, relations: %{8227841 => %Electric.Postgres.LogicalReplication.Messages.Relation{id: 8227841, namespace: "public", name: "TaskRun", replica_identity: :all_columns, columns: [%Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "id", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "idempotencyKey", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "payload", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "payloadType", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "context", type_oid: 3802, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "runtimeEnvironmentId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "projectId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "createdAt", type_oid: 1114, type_modifier: 3}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "updatedAt", type_oid: 1114, type_modifier: 3}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "taskIdentifier", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "lockedAt", type_oid: 1114, type_modifier: 3}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "lockedById", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "friendlyId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "lockedToVersionId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "traceContext", type_oid: 3802, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "spanId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "traceId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "concurrencyKey", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "queue", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "number", type_oid: 23, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "isTest", type_oid: 16, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "status", type_oid: 11694477, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "scheduleId", type_oid: 25, ...}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "scheduleInstanceId", ...}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [...], ...}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{...}, ...]}}}, step: :streaming, applied_wal: 3773341583840}}, auto_reconnect: false, reconnect_backoff: 500, streaming: 500}}
Callback mode: &Postgrex.ReplicationConnection.handle_event/4, state_enter: false
23:03:47.774 [notice] GenStage consumer #PID<0.1221695.0> is stopping after receiving cancel from producer #PID<0.1219622.0> with reason: :shutdown
... many more of these
23:03:47.782 [debug] Starting Elixir.Electric.Shapes.ConsumerSupervisor
23:03:48.976 [debug] Starting consumer for 100199674-1730244984831
23:03:48.987 [debug] 1 consumers of replication stream
... many more of these
@KyleAMathews
Copy link
Contributor

I was helping @ericallam look at this. I found it odd too that all the consumers failed when it seems just one of them errored. This bug is stopping all shapes from updating. Even if there's a bug w/ where clause matching, ideally it'd just affect the one shape.

@ericallam
Copy link
Author

There array where clause looks like this:

WHERE "runTags" @> ARRAY['foobar']

@ericallam
Copy link
Author

Just tried to recreate this issue locally and I'm not able to. Looking back at the logs above, I wonder if the following line is the cause, or just a consequence of the issue:

** (DBConnection.ConnectionError) ssl send: closed

@balegas
Copy link
Contributor

balegas commented Nov 1, 2024

Even if there's a bug w/ where clause matching, ideally it'd just affect the one shape.

If one consumer fails, we don't move forward with other consumers to ensure the transaction is applied across all shape logs. (Maybe) it not necessary to restart all consumers, but we don't allow the replication stream to advance.

From logs, it seems that the consumers have shut down and trying to reestablish connection. It also look like they are not succeeding because database connection is gone.

We'll try to simulating some crash scenarios to see if we can reproduce the erro.

@KyleAMathews
Copy link
Contributor

If one consumer fails, we don't move forward with other consumers to ensure the transaction is applied across all shape logs. (Maybe) it not necessary to restart all consumers, but we don't allow the replication stream to advance.

Why would we want that? I'm struggling to think of a scenario where this would be a good thing...

If a shape consumer fails my assumption is there's a bug of some sort. If we have 99 good shapes and 1 bad shape, why not just kill the bad shape and let the others continue to update?

The problem here is there's nothing Eric or any other Electric user can do to resolve an issue with a shape consumer. There's no alerts, there's no way to remediate it. Eric only heard about the problem because a customer complained. One bug broke every other unrelated user on the system. And then there was nothing he could do other than kill the server.

How I imagine this would work is that a shape consumer failing would spit out a big warning with diagnostic info and then it would get removed. If the same shape definition is then recreated and then fails say 2 more times, then we'd permanently block it from being created again until the bug is fixed and the server has upgraded.

We still don't know what happened here of course so I'm not saying the above is the fix for this exact issue — but in general, we want graceful degradation where we're keeping as many things going as fast as possible and not halt the world type approaches as those are far more disruptive.

@ericallam
Copy link
Author

@KyleAMathews @balegas One other consideration is that from our perspective, pretty much the worst thing that the electric server could do is fail to keep up with the WAL and let the replication slot get behind, because eventually (in a few hours) our entire database will blow up because we run a sequence scan fairly frequently (yay queuing in postgres!) and sequence scan + xmin horizon lag + dead tuples not being vacuumed = database go 💥

@balegas
Copy link
Contributor

balegas commented Nov 5, 2024

Removing the problematic shapes is a good idea, that way we preserve integrity across the healthy shapes and have better chance of not failing in the round.

Absolutely, Electric needs to be able to keep up with the load from Postgres and recover from errors without lagging for too long. Note that you can safeguard against unpredicted WAL growth by tuning Postgres WAL size.

@balegas
Copy link
Contributor

balegas commented Nov 29, 2024

Closing this issue, as we now prevent issues from a single shape to propagate up the tree #2009

@balegas balegas closed this as completed Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants