electric crashing/restarting - possibly due to consumer with an array where filter #1911

ericallam · 2024-10-31T23:32:33Z

Our 0.7.5 electric server is crashing periodically, and I think it's related to a consumer with an array where clause, when that consumer encounters a change with the column being nil. Here are some logs to show what's happening:

23:03:47.770 [info] Received transaction 2755874852 from Postgres at 36E/8CAF49B0
23:03:47.770 [debug] Txn received in ShapeLogCollector: %Electric.Replication.Changes.Transaction{xid: 2755874852, commit_timestamp: ~U[2024-10-31 13:32:40.883363Z], lsn: #Lsn<36E/8CAF49B0>, last_log_offset: LogOffset.new(3773341583792, 0), changes: [%Electric.Replication.Changes.UpdatedRecord{relation: {"public", "TaskRun"}, old_record: %{}
23:03:47.773 [error] :gen_statem #PID<0.1224622.0> terminating
** (DBConnection.ConnectionError) ssl send: closed
    (stdlib 6.0.1) gen_statem.erl:3242: :gen_statem.loop_state_callback_result/11
    (stdlib 6.0.1) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Queue: [info: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.5282>, :tls_connection, :undefined}, [#PID<0.1222172.0>, #PID<0.1222144.0>]}, <<100, 0, 0, 13, 1, 119, 0, 0, 3, 110, 140, 11, 97, 160, 0, 0, 3, 110, 140, 11, 97, 160, 0, 2, 200, 204, 147, 101, 186, 240, 85, 0, 125, 140, 1, 79, 0, 52, 116, 0, 0, 0, 25, 99, 109, 49, ...>>}]
State: {:no_state, %Postgrex.ReplicationConnection{protocol: %Postgrex.Protocol{sock: {:ssl, {:sslsocket, {:gen_tcp, #Port<0.5282>, :tls_connection, :undefined}, [#PID<0.1222172.0>, #PID<0.1222144.0>]}}, connection_id: 2405, connection_key: 1939667444, peer: {{172, 16, 14, 132}, 5432}, types: {Postgrex.DefaultTypes, #Reference<0.3188593347.4200988677.234352>}, null: nil, timeout: 15000, ping_timeout: 15000, parameters: #Reference<0.3188593348.406061057.219093>, queries: #Reference<0.3188593348.406192129.219089>, postgres: :idle, transactions: :naive, buffer: :active_once, disconnect_on_error_codes: [], scram: %{iterations: 4096, auth_message: "n=,r=XFGH5yY3GLJoMCDm2/6tT3w6,r=XFGH5yY3GLJoMCDm2/6tT3w6//Wbfl3Fm5y9TyqtJoXpfMZe,s=7QZ2YWuj3HLZ02xJ27ECDw==,i=4096,c=biws,r=XFGH5yY3GLJoMCDm2/6tT3w6//Wbfl3Fm5y9TyqtJoXpfMZe", salt: <<237, 6, 118, 97, 107, 163, 220, 114, 217, 211, 108, 73, 219, 177, 2, 15>>}, disable_composite_types: false, messages: []}, state: {Electric.Postgres.ReplicationClient, %Electric.Postgres.ReplicationClient.State{transaction_received: {Electric.Replication.ShapeLogCollector, :store_transaction, [{:via, Registry, {Electric.Registry.Processes, {Electric.Replication.ShapeLogCollector, :default}}}]}, relation_received: {Electric.Replication.ShapeLogCollector, :handle_relation_msg, [{:via, Registry, {Electric.Registry.Processes, {Electric.Replication.ShapeLogCollector, :default}}}]}, publication_name: "electric_publication_trigger", try_creating_publication?: true, start_streaming?: false, slot_name: "electric_slot_trigger", display_settings: [], origin: "postgres", txn_collector: %Electric.Postgres.ReplicationClient.Collector{transaction: nil, tx_op_index: nil, relations: %{8227841 => %Electric.Postgres.LogicalReplication.Messages.Relation{id: 8227841, namespace: "public", name: "TaskRun", replica_identity: :all_columns, columns: [%Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "id", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "idempotencyKey", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "payload", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "payloadType", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "context", type_oid: 3802, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "runtimeEnvironmentId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "projectId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "createdAt", type_oid: 1114, type_modifier: 3}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "updatedAt", type_oid: 1114, type_modifier: 3}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "taskIdentifier", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "lockedAt", type_oid: 1114, type_modifier: 3}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "lockedById", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "friendlyId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "lockedToVersionId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "traceContext", type_oid: 3802, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "spanId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "traceId", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "concurrencyKey", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "queue", type_oid: 25, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "number", type_oid: 23, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "isTest", type_oid: 16, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "status", type_oid: 11694477, type_modifier: -1}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "scheduleId", type_oid: 25, ...}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [:key], name: "scheduleInstanceId", ...}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{flags: [...], ...}, %Electric.Postgres.LogicalReplication.Messages.Relation.Column{...}, ...]}}}, step: :streaming, applied_wal: 3773341583840}}, auto_reconnect: false, reconnect_backoff: 500, streaming: 500}}
Callback mode: &Postgrex.ReplicationConnection.handle_event/4, state_enter: false
23:03:47.774 [notice] GenStage consumer #PID<0.1221695.0> is stopping after receiving cancel from producer #PID<0.1219622.0> with reason: :shutdown
... many more of these
23:03:47.782 [debug] Starting Elixir.Electric.Shapes.ConsumerSupervisor
23:03:48.976 [debug] Starting consumer for 100199674-1730244984831
23:03:48.987 [debug] 1 consumers of replication stream
... many more of these

The text was updated successfully, but these errors were encountered:

KyleAMathews · 2024-10-31T23:59:46Z

I was helping @ericallam look at this. I found it odd too that all the consumers failed when it seems just one of them errored. This bug is stopping all shapes from updating. Even if there's a bug w/ where clause matching, ideally it'd just affect the one shape.

ericallam · 2024-11-01T00:01:14Z

There array where clause looks like this:

WHERE "runTags" @> ARRAY['foobar']

ericallam · 2024-11-01T08:51:24Z

Just tried to recreate this issue locally and I'm not able to. Looking back at the logs above, I wonder if the following line is the cause, or just a consequence of the issue:

** (DBConnection.ConnectionError) ssl send: closed

balegas · 2024-11-01T11:33:19Z

Even if there's a bug w/ where clause matching, ideally it'd just affect the one shape.

If one consumer fails, we don't move forward with other consumers to ensure the transaction is applied across all shape logs. (Maybe) it not necessary to restart all consumers, but we don't allow the replication stream to advance.

From logs, it seems that the consumers have shut down and trying to reestablish connection. It also look like they are not succeeding because database connection is gone.

We'll try to simulating some crash scenarios to see if we can reproduce the erro.

KyleAMathews · 2024-11-01T13:26:10Z

If one consumer fails, we don't move forward with other consumers to ensure the transaction is applied across all shape logs. (Maybe) it not necessary to restart all consumers, but we don't allow the replication stream to advance.

Why would we want that? I'm struggling to think of a scenario where this would be a good thing...

If a shape consumer fails my assumption is there's a bug of some sort. If we have 99 good shapes and 1 bad shape, why not just kill the bad shape and let the others continue to update?

The problem here is there's nothing Eric or any other Electric user can do to resolve an issue with a shape consumer. There's no alerts, there's no way to remediate it. Eric only heard about the problem because a customer complained. One bug broke every other unrelated user on the system. And then there was nothing he could do other than kill the server.

How I imagine this would work is that a shape consumer failing would spit out a big warning with diagnostic info and then it would get removed. If the same shape definition is then recreated and then fails say 2 more times, then we'd permanently block it from being created again until the bug is fixed and the server has upgraded.

We still don't know what happened here of course so I'm not saying the above is the fix for this exact issue — but in general, we want graceful degradation where we're keeping as many things going as fast as possible and not halt the world type approaches as those are far more disruptive.

ericallam · 2024-11-01T19:34:44Z

@KyleAMathews @balegas One other consideration is that from our perspective, pretty much the worst thing that the electric server could do is fail to keep up with the WAL and let the replication slot get behind, because eventually (in a few hours) our entire database will blow up because we run a sequence scan fairly frequently (yay queuing in postgres!) and sequence scan + xmin horizon lag + dead tuples not being vacuumed = database go 💥

balegas · 2024-11-05T11:43:14Z

Removing the problematic shapes is a good idea, that way we preserve integrity across the healthy shapes and have better chance of not failing in the round.

Absolutely, Electric needs to be able to keep up with the load from Postgres and recover from errors without lagging for too long. Note that you can safeguard against unpredicted WAL growth by tuning Postgres WAL size.

balegas · 2024-11-29T10:24:54Z

Closing this issue, as we now prevent issues from a single shape to propagate up the tree #2009

balegas mentioned this issue Nov 5, 2024

Remove a shape if anything goes wrong with it #1925

Closed

balegas closed this as completed Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

electric crashing/restarting - possibly due to consumer with an array where filter #1911

electric crashing/restarting - possibly due to consumer with an array where filter #1911

ericallam commented Oct 31, 2024 •

edited by icehaunter

Loading

KyleAMathews commented Oct 31, 2024

ericallam commented Nov 1, 2024

ericallam commented Nov 1, 2024

balegas commented Nov 1, 2024

KyleAMathews commented Nov 1, 2024

ericallam commented Nov 1, 2024

balegas commented Nov 5, 2024

balegas commented Nov 29, 2024

electric crashing/restarting - possibly due to consumer with an array where filter #1911

electric crashing/restarting - possibly due to consumer with an array where filter #1911

Comments

ericallam commented Oct 31, 2024 • edited by icehaunter Loading

KyleAMathews commented Oct 31, 2024

ericallam commented Nov 1, 2024

ericallam commented Nov 1, 2024

balegas commented Nov 1, 2024

KyleAMathews commented Nov 1, 2024

ericallam commented Nov 1, 2024

balegas commented Nov 5, 2024

balegas commented Nov 29, 2024

ericallam commented Oct 31, 2024 •

edited by icehaunter

Loading