You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With a sharded keyspace, when one shard's primary tablet crashes or is otherwise restarted/replaced, the vtgate buffering code treats that as a keyspace level resharding event (work for that was done in #8700).
This means that if e.g. you're restarting/replacing the primary tablet in each shard in succession, the buffering will not be unblocked until all of the shards are completed — which may then be beyond the buffer_max_failover_duration.
This results in a lot of unnecessary query delays and failures.
From reading #8700 we can see that treating the reshard event at the keyspace level was intentional. The real bug is that we're treating this particular case as a reshard event rather than a primary is not serving / primary promotion event — which is obviously per shard.
Version: 18.0.0-SNAPSHOT (Git revision e6df5d8b3eac8b54cd36c94a079f390ef0c96338 branch 'main') built on Fri Aug 25 09:36:47 EDT 2023 by [email protected] using go1.20.6 darwin/arm64
Operating System and Environment details
N/A
Log Fragments
I0825 09:44:31.138972 11587 flags.go:187] vtgate buffer enabled. PRIMARY requests will be buffered during detected failovers.
I0825 09:44:31.139187 11587 keyspace_events.go:90] started watching keyspace events in"zone1"
I0825 09:44:31.142645 11587 keyspace_events.go:537] created dedicated watcher for keyspace zone1/commerce
I0825 09:44:31.147711 11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:44:31.147713 11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:44:31.147728 11587 shard_buffer.go:485] disruption in shard commerce/80- resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
I0825 09:44:31.147742 11587 shard_buffer.go:485] disruption in shard commerce/-80 resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
I0825 09:44:42.124610 11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:44:42.124624 11587 shard_buffer.go:284] Starting buffering for shard: commerce/-80 (window: 10s, size: 1000, max failover duration: 20s) (A failover was detected by this seen error: current keyspace is being resharded.)
I0825 09:44:42.397176 11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:44:42.397180 11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:44:42.397187 11587 shard_buffer.go:485] disruption in shard commerce/80- resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
I0825 09:44:42.397190 11587 shard_buffer.go:485] disruption in shard commerce/-80 resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
I0825 09:44:42.397196 11587 shard_buffer.go:590] Stopping buffering for shard: commerce/-80 after: 0.3 seconds due to: a primary promotion has been detected. Draining 1 buffered requests now.
I0825 09:44:42.397668 11587 shard_buffer.go:616] Draining finished for shard: commerce/-80 Took: 458.209µs for: 1 requests.
I0825 09:44:52.540470 11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:44:52.540481 11587 shard_buffer.go:284] Starting buffering for shard: commerce/80- (window: 10s, size: 1000, max failover duration: 20s) (A failover was detected by this seen error: current keyspace is being resharded.)
I0825 09:45:12.541600 11587 shard_buffer.go:590] Stopping buffering forshard: commerce/80- after: 20.0 seconds due to: stopping buffering because failover did not finishintime (20s). Draining 1 buffered requests now.
I0825 09:45:12.544185 11587 shard_buffer.go:616] Draining finished for shard: commerce/80- Took: 2.539666ms for: 1 requests.
I0825 09:45:12.581370 11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:45:12.581486 11587 shard_buffer.go:191] FailoverTooRecent-commerce/80-: NOT starting buffering for shard: commerce/80- because the last failover which triggered buffering is too recent (39.896ms < 1m0s). (A failover was detected by this seen error: Code: CLUSTER_EVENT
I0825 09:45:12.582033 11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
...
I0825 09:45:17.550773 11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:45:17.598480 11587 shard_buffer.go:191] FailoverTooRecent-commerce/80-: NOT starting buffering for shard: commerce/80- because the last failover which triggered buffering is too recent (5.056904291s < 1m0s). (A failover was detected by this seen error: Code: CLUSTER_EVENT
I0825 09:45:17.598605 11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
...
I0825 09:45:22.603785 11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:45:22.603796 11587 shard_buffer.go:191] FailoverTooRecent-commerce/80-: NOT starting buffering for shard: commerce/80- because the last failover which triggered buffering is too recent (10.062217583s < 1m0s). (A failover was detected by this seen error: Code: CLUSTER_EVENT
I0825 09:45:22.603925 11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
...
I0825 09:45:22.984544 11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:45:25.371575 11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:45:25.371581 11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:45:25.371599 11587 shard_buffer.go:485] disruption in shard commerce/80- resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
I0825 09:45:25.371614 11587 shard_buffer.go:485] disruption in shard commerce/-80 resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
The text was updated successfully, but these errors were encountered:
Overview of the Issue
With a sharded keyspace, when one shard's primary tablet crashes or is otherwise restarted/replaced, the vtgate buffering code treats that as a keyspace level resharding event (work for that was done in #8700).
This means that if e.g. you're restarting/replacing the primary tablet in each shard in succession, the buffering will not be unblocked until all of the shards are completed — which may then be beyond the
buffer_max_failover_duration
.This results in a lot of unnecessary query delays and failures.
From reading #8700 we can see that treating the reshard event at the keyspace level was intentional. The real bug is that we're treating this particular case as a reshard event rather than a primary is not serving / primary promotion event — which is obviously per shard.
Reproduction Steps
See: https://gist.github.com/mattlord/ea49abc347b02b2ee22db4b776800a8d
Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: