You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm indirectly using ShardReplicationStatuses via BackupShard, and I found a case where it hangs forever. I spun up some rdonly tablets temporarily, but when I deleted them, they didn't remove their tablet records. That left references with bad host names. I would expect for the call to either return an error that it couldn't contact one of the tablets, but instead, it hangs there until the context timeout hits, which is a couple hours by default for the backup call.
It looks like the actual backup is timing out, which led me to set the action_timeout to 6 hours, which made it harder to diagnose the problem. It would have been much nicer to see an error about not being able to connect to a tablet to get the replication status. I haven't dug deeper yet into the SlaveStatus interface call to see where that is set.
I'm not sure if tweaking the behavior of that call is appropriate or if it will have more cascading effects elsewhere that expect it to wait forever.
The text was updated successfully, but these errors were encountered:
I'm indirectly using
ShardReplicationStatuses
viaBackupShard
, and I found a case where it hangs forever. I spun up somerdonly
tablets temporarily, but when I deleted them, they didn't remove their tablet records. That left references with bad host names. I would expect for the call to either return an error that it couldn't contact one of the tablets, but instead, it hangs there until the context timeout hits, which is a couple hours by default for the backup call.vitess/go/vt/wrangler/reparent.go
Lines 78 to 90 in 744b8e2
It looks like the actual backup is timing out, which led me to set the
action_timeout
to 6 hours, which made it harder to diagnose the problem. It would have been much nicer to see an error about not being able to connect to a tablet to get the replication status. I haven't dug deeper yet into theSlaveStatus
interface call to see where that is set.I'm not sure if tweaking the behavior of that call is appropriate or if it will have more cascading effects elsewhere that expect it to wait forever.
The text was updated successfully, but these errors were encountered: