Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beta: 20160421 #6120

Closed
mberhault opened this issue Apr 18, 2016 · 9 comments
Closed

beta: 20160421 #6120

mberhault opened this issue Apr 18, 2016 · 9 comments
Assignees

Comments

@mberhault
Copy link
Contributor

Currently running sha: 94360e2

  • beta cluster: 5 nodes upgraded from existing cluster
  • gamma cluster: 6 nodes from scratch
  • rho cluster: 4 nodes with race-enabled build

All have photos/block_writer running against localhost.

No particular new issues since last release, but llrb: inverted range #6027 is a pain.

@bdarnell
Copy link
Contributor

Current status:

  • cockroach-rho-2 filled up its disk, while the other nodes in the rho cluster are at 18% used.
  • cockroach-gamma-5 is down with no panics in its logs. The supervisor log shows that it was terminated with SIGKILL. Two of the rho nodes were also SIGKILLed. @mberhault, was that you?

The beta cluster is in bad shape; the admin UI won't load. Logs suggest it's the same raft issues we've been watching. The gamma cluster is working, but it's slow and it looks like it might have been leaking memory (RSS is up to ~12GB on the live nodes). I'm going to restart with a new build soon, but I'll see what I can grab from the gamma cluster first.

@bdarnell
Copy link
Contributor

The leak is in the C side (12GB RSS, 50MB active go heap), so I don't think we're going to be able to get much out of this without jemalloc. I've got logs saved locally.

I'm restarting the beta and gamma clusters with e6880ae. I'll leave rho alone.

@mberhault
Copy link
Contributor Author

sigkill may be the OOM killer, which sometimes gets triggered before we
fail to alloc. In that case, the logs wouldn't show anything, you have to
check dmesg.

The rho cluster doesn't have a very long lifetime, after a day or two, you
can expect at least one node to OOM due to the race detector overhead, even
with 52GB of ram.

On Wed, Apr 20, 2016 at 7:21 PM, Ben Darnell [email protected]
wrote:

The leak is in the C side (12GB RSS, 50MB active go heap), so I don't
think we're going to be able to get much out of this without jemalloc. I've
got logs saved locally.

I'm restarting the beta and gamma clusters with e6880ae
e6880ae.
I'll leave rho alone.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6120 (comment)

@bdarnell
Copy link
Contributor

We had a crash on the gamma cluster: #6190. I've restarted it.

The beta cluster has recovered but gamma is now unhealthy (no graphs load in the UI). The logs aren't telling me much, although I'm seeing a lot of (small) snapshots being generated repeatedly.

@mberhault
Copy link
Contributor Author

beta and gamma: stuck with lots of deadlines exceeded. trace show long-running heartbeats, and txn coordinators, and long-running requests due to retries on deadlines exceeded. We need to tweak some things here to see if we can get it out of that state, gamma (and to some extend, beta) tends to do that quite a bit under load, especially after a node or two gets restarted.

rho: two machines OOM, and one running out of disk. The out of space one is because the /dev/sdb wasn't mounted. This can happen when the machine locks up due to OOM and gets rebooted. I probably didn't notice and just restarted the process. I need to change my setup scripts to add the drive to /etc/fstab. The good news is no detected races in the logs (I have it just log, not crash)

@mberhault
Copy link
Contributor Author

I think we have two areas of concern at this point:

  • once we hit a decent number of deadlines exceeded, we seem never to be able to catch up, even with restarts.
  • this state of affairs causes C memory to keep growing. I have constantly-running profiles on the Go side, but nothing on the C side. I'll see if I can get jemalloc-enabled builds running on one of the test clusters.

@mberhault
Copy link
Contributor Author

However, none of these are new. The two big crashes that were fixed since last beta are:

  • inverted llrb
  • concurrent accesses to the same error structures

@bdarnell
Copy link
Contributor

One of the beta nodes crashed by running out of file descriptors. From the stack traces, it had almost 2000 SQL connections when it died (and a ulimit of 2048 fds):

1987: IO wait [7~307 minutes] [Created by util.ServeHandler.func3 @ net.go:127]
    net       netpoll.go:160               runtime_pollWait(*, 0x72, *)
    net       fd_poll_runtime.go:73        (*pollDesc).Wait(*, 0x72, 0, 0)
    net       fd_poll_runtime.go:78        (*pollDesc).WaitRead(*, 0, 0)
    net       fd_unix.go:250               (*netFD).Read(*, *, 0x400, 0x400, 0, #26878, #60)
    net       net.go:172                   (*conn).Read(*, *, 0x400, 0x400, 0, 0, 0)
    cmux      cmux.go:204                  (*MuxConn).Read(*, *, 0x400, 0x400, 0, 0, 0)
    tls       conn.go:460                  (*block).readFromUntil(*, #26868, *, 0x5, 0, 0)
    tls       conn.go:562                  (*Conn).readRecord(*, #24, 0, 0)
    tls       conn.go:939                  (*Conn).Read(*, *, 0x1000, 0x1000, 0, 0, 0)
    bufio     bufio.go:97                  (*Reader).fill(*)
    bufio     bufio.go:229                 (*Reader).ReadByte(*, 0x6, 0, 0)
    pgwire    encoding.go:93               (*readBuffer).readTypedMsg(*, #26818, *, 0, 0, 0, 0)
    pgwire    v3.go:246                    (*v3Conn).serve(*, *, 0, 0)
    pgwire    server.go:157                (*Server).ServeConn(#334, #26869, *, 0, 0)
    server    server.go:314                (*Server).Start.func6.1(#26867, *)
    util      net.go:125                   ServeHandler.func3.1(#474, #26867, *, #605)
8: IO wait [97~266 minutes] [Created by util.ServeHandler.func3 @ net.go:127]
    net       netpoll.go:160               runtime_pollWait(*, 0x72, *)
    net       fd_poll_runtime.go:73        (*pollDesc).Wait(*, 0x72, 0, 0)
    net       fd_poll_runtime.go:78        (*pollDesc).WaitRead(*, 0, 0)
    net       fd_unix.go:250               (*netFD).Read(*, *, 0x400, 0x400, 0, #26878, #60)
    net       net.go:172                   (*conn).Read(*, *, 0x400, 0x400, 0, 0, 0)
    cmux      cmux.go:204                  (*MuxConn).Read(*, *, 0x400, 0x400, *, 0, 0)
    tls       conn.go:460                  (*block).readFromUntil(*, #26868, *, 0x5, 0, 0)
    tls       conn.go:562                  (*Conn).readRecord(*, #24, 0, 0)
    tls       conn.go:939                  (*Conn).Read(*, *, 0x1000, 0x1000, 0, 0, 0)
    bufio     bufio.go:97                  (*Reader).fill(*)
    bufio     bufio.go:229                 (*Reader).ReadByte(*, 0x6, 0, 0)
    pgwire    encoding.go:93               (*readBuffer).readTypedMsg(*, #26818, *, 0, 0, 0, 0)
    pgwire    v3.go:246                    (*v3Conn).serve(*, *, 0, 0)
    pgwire    server.go:157                (*Server).ServeConn(#334, #26869, *, 0, 0)
    server    server.go:314                (*Server).Start.func6.1(#26867, *)
    util      net.go:125                   ServeHandler.func3.1(#474, #26867, *, #605)

All currently-running nodes have ~70 file descriptors open. But based on the "IO wait" times shown in the stack traces, these connections accumulated over a period of hours; it wasn't a sudden burst of 2000 new connections. Anyway, we have some work to do here but it's not a new issue.

@mberhault
Copy link
Contributor Author

I was just looking at that. The block_writer seems pretty eager about trying new connections:
we have 1.2M sql.Open in the block_writer logs (and the logs were truncated)

1221861: chan receive [0~3 minutes] [Created by sql.Open @ sql.go:493]
    sql sql.go:727 (*DB).connectionOpener(*)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants