Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to send RPC: sending to all replicas failed; last error: unable to dial n6: breaker open #68489

Closed
nick-jones opened this issue Aug 5, 2021 · 4 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-kv KV Team X-blathers-triaged blathers was able to find an owner

Comments

@nick-jones
Copy link

nick-jones commented Aug 5, 2021

Describe the problem

We have a cockroachdb cluster that contains 9 nodes. Those cockroachdb nodes are divided across 3 different kubernetes clusters (3 nodes per-kubernetes cluster).. for simplicities sake I'll refer to these kubernetes clusters as "cluster A", "cluster B" and "cluster C".

Today we had an incident that resulted in kubernetes cluster C having an outage.. the problems experienced were down to network issues and requests going to that cluster likely ranged from being slow to outright failing.

So we were left with 6 healthy nodes across kubernetes cluster A and kubernetes cluster B, and they stayed that way for a short while. At around 13:22 UTC we lost 1 node in kubernetes cluster A and 2 nodes in kubernetes cluster B.. this time this wasn't due to any apparent issues in those kubernetes clusters, rather they moved into this state themselves. The readiness probes started failing at this moment and the cockroachdb admin UI showed them as dead. The consequence of those nodes going down was the entire cockroachdb cluster became unavailable, as we only had 3 of 9 nodes live (or at least only that many in a stable and healthy position).

Logs at the time one of the nodes died: https://gist.github.com/nick-jones/72cd6d7fd8e8dc0ac2e97abada47597e

One thing that caught my eye in particular was...

‹failed to send RPC: sending to all replicas failed; last error: unable to dial n6: breaker open›

n6 is one of the nodes running in cluster C which was potentially unreachable at the time.

The 3 cockroachdb nodes that died all logged this message... the other 3 nodes that remained up did not log this.

To Reproduce

We're going to look to reproduce in one of our lower envs but the circumstances we dealt with today may be difficult to make happen again.

Expected behavior

Healthy cluster

Additional data / screenshots

Screenshot 2021-08-09 at 11 46 46

Environment:

  • CockroachDB version: v21.1.6
  • Server OS: Linux

Additional context

@nick-jones nick-jones added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 5, 2021
@blathers-crl
Copy link

blathers-crl bot commented Aug 5, 2021

Hello, I am Blathers. I am here to help you get the issue triaged.

Hoot - a bug! Though bugs are the bane of my existence, rest assured the wretched thing will get the best of care here.

I have CC'd a few people who may be able to assist you:

  • @cockroachdb/sql-observability (found keywords: admin UI)
  • @cockroachdb/kv (found keywords: kv,Raft,liveness)

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Aug 5, 2021
@knz
Copy link
Contributor

knz commented Aug 5, 2021

@nvanbenschoten do you think this is related to #68419?

@blathers-crl blathers-crl bot added the T-kv KV Team label Aug 5, 2021
@nick-jones
Copy link
Author

nick-jones commented Aug 9, 2021

We tried to replicate this issue in one of our lower environments today but unfortunately were unable to do so. I've updated the ticket description with a link to more complete logs from last Thursday in case that is useful.

@lunevalex
Copy link
Collaborator

@nick-jones we took a look at the logs, and did not see anything conclusive that pointed to a problem. It's not clear what we could do next with a limited set of information. It might help if you see this problem again to pull a debug.zip and share it with us (if you are comfortable), so we can have a deeper look into the problem. I am going to close this ticket for now, but please let us know if this re-occurs and we will take a look again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-community Originated from the community T-kv KV Team X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

3 participants