Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock situation when Cassandra disk is full #49

Open
etiennedi opened this issue Feb 23, 2019 · 5 comments
Open

Deadlock situation when Cassandra disk is full #49

etiennedi opened this issue Feb 23, 2019 · 5 comments

Comments

@etiennedi
Copy link
Member

etiennedi commented Feb 23, 2019

Last night on 3 of my 10 cassandra pods, the attached PVs ran completely full. Unfortunately that left us in a bit of a deadlock situation:

  • Cassandra starts crashing when the disk is entirely full, doesn't come up again
  • We cannot add new cassandra pods (to increase the overall disk space) because the StatefulSet is configured with OrderedReady. This means new pods won't be scheduled, because existing pods are crashing

I think I just solved this situation manually like this:

  • Use the small time between container start and container crash to kubectl exec in there
  • then manually delete the entire cassandra commit log in /var/lib/cassandra/commitlog
  • wait for the pod to come up healthy again
  • repeat for other pods until all are up again
  • now new pods can be scheduled again
  • hope that cassandra will move the data around in a way where the usage is even again over all pods

However, this is not something we can do in a production setting, because I literally just deleted 8GB of data per crashing pod. In real life we'd have no way of knowing which data would be deleted and also couldn't import it again.

So I think for production we either need:

  • Very strict Monitoring of the free space on the PVs or
  • A mechanism to auto-scale the StatefulSet based on the available disk space on the attached PVs. Is there such a thing, @idcrosby ?
@etiennedi etiennedi changed the title Deadline situation when Cassandra disk is full Deadlock situation when Cassandra disk is full Feb 23, 2019
@etiennedi
Copy link
Member Author

etiennedi commented Feb 23, 2019

@etiennedi
Copy link
Member Author

A guideline is to keep disk size per node at around 500GB (which means we can effectively use 250GB): https://wikitech.wikimedia.org/wiki/Cassandra/Hardware

@bobvanluijt
Copy link
Member

bobvanluijt commented Feb 23, 2019 via email

@etiennedi
Copy link
Member Author

New Learning about scaling Cassandra clusters: When ownership of partitions is lost (because a new node has joined), this space is not freed up automatically. One has to manually run nodetool cleanup on all nodes. This process is considered so expensive that it is manual by design. In turn this means, scaling up a Cassandra cluster will always be a semi-manual process. (Which is OK, because so much can go wrong. Trying to automate for every edge case would be an insane task).

@idcrosby
Copy link
Contributor

Interesting findings @etiennedi. I agree with your recommendation, scaling Cassandra (or pretty much any database) should be a manual task. In addition to the complexity of auto-scaling a database, usually the point at which a database would trigger an autoscale is when it's busiest and therefore a bad time to scale.

As you mention above, the crucial piece is having proper monitoring in place. If we know which metrics to trigger on (disk space) it is straightforward to set up an alert at a certain value (e.g. 50%) to have someone scale the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants