Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: VTGate buffering treating sharded keyspace primary restart as reshard event #13854

Closed
mattlord opened this issue Aug 25, 2023 · 0 comments · Fixed by #13856
Closed

Comments

@mattlord
Copy link
Contributor

mattlord commented Aug 25, 2023

Overview of the Issue

With a sharded keyspace, when one shard's primary tablet crashes or is otherwise restarted/replaced, the vtgate buffering code treats that as a keyspace level resharding event (work for that was done in #8700).

This means that if e.g. you're restarting/replacing the primary tablet in each shard in succession, the buffering will not be unblocked until all of the shards are completed — which may then be beyond the buffer_max_failover_duration.

This results in a lot of unnecessary query delays and failures.

From reading #8700 we can see that treating the reshard event at the keyspace level was intentional. The real bug is that we're treating this particular case as a reshard event rather than a primary is not serving / primary promotion event — which is obviously per shard.

Reproduction Steps

See: https://gist.github.com/mattlord/ea49abc347b02b2ee22db4b776800a8d

Binary Version

Version: 18.0.0-SNAPSHOT (Git revision e6df5d8b3eac8b54cd36c94a079f390ef0c96338 branch 'main') built on Fri Aug 25 09:36:47 EDT 2023 by [email protected] using go1.20.6 darwin/arm64

Operating System and Environment details

N/A

Log Fragments

I0825 09:44:31.138972   11587 flags.go:187] vtgate buffer enabled. PRIMARY requests will be buffered during detected failovers.
I0825 09:44:31.139187   11587 keyspace_events.go:90] started watching keyspace events in "zone1"
I0825 09:44:31.142645   11587 keyspace_events.go:537] created dedicated watcher for keyspace zone1/commerce
I0825 09:44:31.147711   11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:44:31.147713   11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:44:31.147728   11587 shard_buffer.go:485] disruption in shard commerce/80- resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
I0825 09:44:31.147742   11587 shard_buffer.go:485] disruption in shard commerce/-80 resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
I0825 09:44:42.124610   11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:44:42.124624   11587 shard_buffer.go:284] Starting buffering for shard: commerce/-80 (window: 10s, size: 1000, max failover duration: 20s) (A failover was detected by this seen error: current keyspace is being resharded.)
I0825 09:44:42.397176   11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:44:42.397180   11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:44:42.397187   11587 shard_buffer.go:485] disruption in shard commerce/80- resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
I0825 09:44:42.397190   11587 shard_buffer.go:485] disruption in shard commerce/-80 resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
I0825 09:44:42.397196   11587 shard_buffer.go:590] Stopping buffering for shard: commerce/-80 after: 0.3 seconds due to: a primary promotion has been detected. Draining 1 buffered requests now.
I0825 09:44:42.397668   11587 shard_buffer.go:616] Draining finished for shard: commerce/-80 Took: 458.209µs for: 1 requests.
I0825 09:44:52.540470   11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:44:52.540481   11587 shard_buffer.go:284] Starting buffering for shard: commerce/80- (window: 10s, size: 1000, max failover duration: 20s) (A failover was detected by this seen error: current keyspace is being resharded.)
I0825 09:45:12.541600   11587 shard_buffer.go:590] Stopping buffering for shard: commerce/80- after: 20.0 seconds due to: stopping buffering because failover did not finish in time (20s). Draining 1 buffered requests now.
I0825 09:45:12.544185   11587 shard_buffer.go:616] Draining finished for shard: commerce/80- Took: 2.539666ms for: 1 requests.
I0825 09:45:12.581370   11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:45:12.581486   11587 shard_buffer.go:191] FailoverTooRecent-commerce/80-: NOT starting buffering for shard: commerce/80- because the last failover which triggered buffering is too recent (39.896ms < 1m0s). (A failover was detected by this seen error: Code: CLUSTER_EVENT
I0825 09:45:12.582033   11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
...
I0825 09:45:17.550773   11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:45:17.598480   11587 shard_buffer.go:191] FailoverTooRecent-commerce/80-: NOT starting buffering for shard: commerce/80- because the last failover which triggered buffering is too recent (5.056904291s < 1m0s). (A failover was detected by this seen error: Code: CLUSTER_EVENT
I0825 09:45:17.598605   11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
...
I0825 09:45:22.603785   11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:45:22.603796   11587 shard_buffer.go:191] FailoverTooRecent-commerce/80-: NOT starting buffering for shard: commerce/80- because the last failover which triggered buffering is too recent (10.062217583s < 1m0s). (A failover was detected by this seen error: Code: CLUSTER_EVENT
I0825 09:45:22.603925   11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
...
I0825 09:45:22.984544   11587 buffer.go:92] CausedByFailover signalling failover for reason: current keyspace is being resharded
I0825 09:45:25.371575   11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:45:25.371581   11587 keyspace_events.go:304] keyspace event resolved: commerce/commerce is now consistent (serving: true)
I0825 09:45:25.371599   11587 shard_buffer.go:485] disruption in shard commerce/80- resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
I0825 09:45:25.371614   11587 shard_buffer.go:485] disruption in shard commerce/-80 resolved (serving: true), movetable state discovery.MoveTablesState{Typ:0, State:0}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
1 participant