failed to send RPC: sending to all replicas failed; last error: unable to dial n6: breaker open #68489
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-community
Originated from the community
T-kv
KV Team
X-blathers-triaged
blathers was able to find an owner
Describe the problem
We have a cockroachdb cluster that contains 9 nodes. Those cockroachdb nodes are divided across 3 different kubernetes clusters (3 nodes per-kubernetes cluster).. for simplicities sake I'll refer to these kubernetes clusters as "cluster A", "cluster B" and "cluster C".
Today we had an incident that resulted in kubernetes cluster C having an outage.. the problems experienced were down to network issues and requests going to that cluster likely ranged from being slow to outright failing.
So we were left with 6 healthy nodes across kubernetes cluster A and kubernetes cluster B, and they stayed that way for a short while. At around 13:22 UTC we lost 1 node in kubernetes cluster A and 2 nodes in kubernetes cluster B.. this time this wasn't due to any apparent issues in those kubernetes clusters, rather they moved into this state themselves. The readiness probes started failing at this moment and the cockroachdb admin UI showed them as dead. The consequence of those nodes going down was the entire cockroachdb cluster became unavailable, as we only had 3 of 9 nodes live (or at least only that many in a stable and healthy position).
Logs at the time one of the nodes died: https://gist.github.com/nick-jones/72cd6d7fd8e8dc0ac2e97abada47597e
One thing that caught my eye in particular was...
n6 is one of the nodes running in cluster C which was potentially unreachable at the time.
The 3 cockroachdb nodes that died all logged this message... the other 3 nodes that remained up did not log this.
To Reproduce
We're going to look to reproduce in one of our lower envs but the circumstances we dealt with today may be difficult to make happen again.
Expected behavior
Healthy cluster
Additional data / screenshots
Environment:
Additional context
The text was updated successfully, but these errors were encountered: