beta: 20160421 #6120

mberhault · 2016-04-18T16:57:39Z

Currently running sha: 94360e2

beta cluster: 5 nodes upgraded from existing cluster
gamma cluster: 6 nodes from scratch
rho cluster: 4 nodes with race-enabled build

All have photos/block_writer running against localhost.

No particular new issues since last release, but llrb: inverted range #6027 is a pain.

The text was updated successfully, but these errors were encountered:

bdarnell · 2016-04-20T16:14:29Z

Current status:

cockroach-rho-2 filled up its disk, while the other nodes in the rho cluster are at 18% used.
cockroach-gamma-5 is down with no panics in its logs. The supervisor log shows that it was terminated with SIGKILL. Two of the rho nodes were also SIGKILLed. @mberhault, was that you?

The beta cluster is in bad shape; the admin UI won't load. Logs suggest it's the same raft issues we've been watching. The gamma cluster is working, but it's slow and it looks like it might have been leaking memory (RSS is up to ~12GB on the live nodes). I'm going to restart with a new build soon, but I'll see what I can grab from the gamma cluster first.

bdarnell · 2016-04-20T17:21:03Z

The leak is in the C side (12GB RSS, 50MB active go heap), so I don't think we're going to be able to get much out of this without jemalloc. I've got logs saved locally.

I'm restarting the beta and gamma clusters with e6880ae. I'll leave rho alone.

mberhault · 2016-04-20T20:21:48Z

sigkill may be the OOM killer, which sometimes gets triggered before we
fail to alloc. In that case, the logs wouldn't show anything, you have to
check dmesg.

The rho cluster doesn't have a very long lifetime, after a day or two, you
can expect at least one node to OOM due to the race detector overhead, even
with 52GB of ram.

On Wed, Apr 20, 2016 at 7:21 PM, Ben Darnell [email protected]
wrote:

The leak is in the C side (12GB RSS, 50MB active go heap), so I don't
think we're going to be able to get much out of this without jemalloc. I've
got logs saved locally.

I'm restarting the beta and gamma clusters with e6880ae
e6880ae.
I'll leave rho alone.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#6120 (comment)

bdarnell · 2016-04-21T02:15:08Z

We had a crash on the gamma cluster: #6190. I've restarted it.

The beta cluster has recovered but gamma is now unhealthy (no graphs load in the UI). The logs aren't telling me much, although I'm seeing a lot of (small) snapshots being generated repeatedly.

mberhault · 2016-04-21T08:19:27Z

beta and gamma: stuck with lots of deadlines exceeded. trace show long-running heartbeats, and txn coordinators, and long-running requests due to retries on deadlines exceeded. We need to tweak some things here to see if we can get it out of that state, gamma (and to some extend, beta) tends to do that quite a bit under load, especially after a node or two gets restarted.

rho: two machines OOM, and one running out of disk. The out of space one is because the /dev/sdb wasn't mounted. This can happen when the machine locks up due to OOM and gets rebooted. I probably didn't notice and just restarted the process. I need to change my setup scripts to add the drive to /etc/fstab. The good news is no detected races in the logs (I have it just log, not crash)

mberhault · 2016-04-21T08:48:03Z

I think we have two areas of concern at this point:

once we hit a decent number of deadlines exceeded, we seem never to be able to catch up, even with restarts.
this state of affairs causes C memory to keep growing. I have constantly-running profiles on the Go side, but nothing on the C side. I'll see if I can get jemalloc-enabled builds running on one of the test clusters.

mberhault · 2016-04-21T08:48:50Z

However, none of these are new. The two big crashes that were fixed since last beta are:

inverted llrb
concurrent accesses to the same error structures

bdarnell · 2016-04-21T15:20:27Z

One of the beta nodes crashed by running out of file descriptors. From the stack traces, it had almost 2000 SQL connections when it died (and a ulimit of 2048 fds):

1987: IO wait [7~307 minutes] [Created by util.ServeHandler.func3 @ net.go:127]
    net       netpoll.go:160               runtime_pollWait(*, 0x72, *)
    net       fd_poll_runtime.go:73        (*pollDesc).Wait(*, 0x72, 0, 0)
    net       fd_poll_runtime.go:78        (*pollDesc).WaitRead(*, 0, 0)
    net       fd_unix.go:250               (*netFD).Read(*, *, 0x400, 0x400, 0, #26878, #60)
    net       net.go:172                   (*conn).Read(*, *, 0x400, 0x400, 0, 0, 0)
    cmux      cmux.go:204                  (*MuxConn).Read(*, *, 0x400, 0x400, 0, 0, 0)
    tls       conn.go:460                  (*block).readFromUntil(*, #26868, *, 0x5, 0, 0)
    tls       conn.go:562                  (*Conn).readRecord(*, #24, 0, 0)
    tls       conn.go:939                  (*Conn).Read(*, *, 0x1000, 0x1000, 0, 0, 0)
    bufio     bufio.go:97                  (*Reader).fill(*)
    bufio     bufio.go:229                 (*Reader).ReadByte(*, 0x6, 0, 0)
    pgwire    encoding.go:93               (*readBuffer).readTypedMsg(*, #26818, *, 0, 0, 0, 0)
    pgwire    v3.go:246                    (*v3Conn).serve(*, *, 0, 0)
    pgwire    server.go:157                (*Server).ServeConn(#334, #26869, *, 0, 0)
    server    server.go:314                (*Server).Start.func6.1(#26867, *)
    util      net.go:125                   ServeHandler.func3.1(#474, #26867, *, #605)
8: IO wait [97~266 minutes] [Created by util.ServeHandler.func3 @ net.go:127]
    net       netpoll.go:160               runtime_pollWait(*, 0x72, *)
    net       fd_poll_runtime.go:73        (*pollDesc).Wait(*, 0x72, 0, 0)
    net       fd_poll_runtime.go:78        (*pollDesc).WaitRead(*, 0, 0)
    net       fd_unix.go:250               (*netFD).Read(*, *, 0x400, 0x400, 0, #26878, #60)
    net       net.go:172                   (*conn).Read(*, *, 0x400, 0x400, 0, 0, 0)
    cmux      cmux.go:204                  (*MuxConn).Read(*, *, 0x400, 0x400, *, 0, 0)
    tls       conn.go:460                  (*block).readFromUntil(*, #26868, *, 0x5, 0, 0)
    tls       conn.go:562                  (*Conn).readRecord(*, #24, 0, 0)
    tls       conn.go:939                  (*Conn).Read(*, *, 0x1000, 0x1000, 0, 0, 0)
    bufio     bufio.go:97                  (*Reader).fill(*)
    bufio     bufio.go:229                 (*Reader).ReadByte(*, 0x6, 0, 0)
    pgwire    encoding.go:93               (*readBuffer).readTypedMsg(*, #26818, *, 0, 0, 0, 0)
    pgwire    v3.go:246                    (*v3Conn).serve(*, *, 0, 0)
    pgwire    server.go:157                (*Server).ServeConn(#334, #26869, *, 0, 0)
    server    server.go:314                (*Server).Start.func6.1(#26867, *)
    util      net.go:125                   ServeHandler.func3.1(#474, #26867, *, #605)

All currently-running nodes have ~70 file descriptors open. But based on the "IO wait" times shown in the stack traces, these connections accumulated over a period of hours; it wasn't a sudden burst of 2000 new connections. Anyway, we have some work to do here but it's not a new issue.

mberhault · 2016-04-21T15:24:12Z

I was just looking at that. The block_writer seems pretty eager about trying new connections:
we have 1.2M sql.Open in the block_writer logs (and the logs were truncated)

1221861: chan receive [0~3 minutes] [Created by sql.Open @ sql.go:493]
    sql sql.go:727 (*DB).connectionOpener(*)

mberhault assigned bdarnell Apr 18, 2016

bdarnell closed this as completed Apr 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beta: 20160421 #6120

beta: 20160421 #6120

mberhault commented Apr 18, 2016

bdarnell commented Apr 20, 2016

bdarnell commented Apr 20, 2016

mberhault commented Apr 20, 2016

bdarnell commented Apr 21, 2016

mberhault commented Apr 21, 2016

mberhault commented Apr 21, 2016

mberhault commented Apr 21, 2016

bdarnell commented Apr 21, 2016

mberhault commented Apr 21, 2016

beta: 20160421 #6120

beta: 20160421 #6120

Comments

mberhault commented Apr 18, 2016

bdarnell commented Apr 20, 2016

bdarnell commented Apr 20, 2016

mberhault commented Apr 20, 2016

bdarnell commented Apr 21, 2016

mberhault commented Apr 21, 2016

mberhault commented Apr 21, 2016

mberhault commented Apr 21, 2016

bdarnell commented Apr 21, 2016

mberhault commented Apr 21, 2016