-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
beta: 20160421 #6120
Comments
Current status:
The beta cluster is in bad shape; the admin UI won't load. Logs suggest it's the same raft issues we've been watching. The gamma cluster is working, but it's slow and it looks like it might have been leaking memory (RSS is up to ~12GB on the live nodes). I'm going to restart with a new build soon, but I'll see what I can grab from the gamma cluster first. |
The leak is in the C side (12GB RSS, 50MB active go heap), so I don't think we're going to be able to get much out of this without jemalloc. I've got logs saved locally. I'm restarting the beta and gamma clusters with e6880ae. I'll leave rho alone. |
sigkill may be the OOM killer, which sometimes gets triggered before we The rho cluster doesn't have a very long lifetime, after a day or two, you On Wed, Apr 20, 2016 at 7:21 PM, Ben Darnell [email protected]
|
We had a crash on the gamma cluster: #6190. I've restarted it. The beta cluster has recovered but gamma is now unhealthy (no graphs load in the UI). The logs aren't telling me much, although I'm seeing a lot of (small) snapshots being generated repeatedly. |
|
I think we have two areas of concern at this point:
|
However, none of these are new. The two big crashes that were fixed since last beta are:
|
One of the beta nodes crashed by running out of file descriptors. From the stack traces, it had almost 2000 SQL connections when it died (and a ulimit of 2048 fds):
All currently-running nodes have ~70 file descriptors open. But based on the "IO wait" times shown in the stack traces, these connections accumulated over a period of hours; it wasn't a sudden burst of 2000 new connections. Anyway, we have some work to do here but it's not a new issue. |
I was just looking at that. The block_writer seems pretty eager about trying new connections:
|
Currently running sha: 94360e2
beta
cluster: 5 nodes upgraded from existing clustergamma
cluster: 6 nodes from scratchrho
cluster: 4 nodes with race-enabled buildAll have photos/block_writer running against localhost.
No particular new issues since last release, but llrb: inverted range #6027 is a pain.
The text was updated successfully, but these errors were encountered: