OOM panic during fixtures import #41009

awoods187 · 2019-09-23T23:13:52Z

While importing, I observed a dead node:

From the logs on the node:

runtime stack:
runtime.throw(0x3f8471c, 0x16)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0x7f9c64bfd818 sp=0x7f9c64bfd7e8 pc=0x78d4c2
runtime.sysMap(0xc248000000, 0x14000000, 0x72f0d18)
	/usr/local/go/src/runtime/mem_linux.go:170 +0xc7 fp=0x7f9c64bfd858 sp=0x7f9c64bfd818 pc=0x777b17
runtime.(*mheap).sysAlloc(0x72ae560, 0x13478000, 0x72ae570, 0x9a3c)
	/usr/local/go/src/runtime/malloc.go:633 +0x1cd fp=0x7f9c64bfd900 sp=0x7f9c64bfd858 pc=0x76aa9d
runtime.(*mheap).grow(0x72ae560, 0x9a3c, 0x0)
	/usr/local/go/src/runtime/mheap.go:1222 +0x42 fp=0x7f9c64bfd958 sp=0x7f9c64bfd900 pc=0x784e32
runtime.(*mheap).allocSpanLocked(0x72ae560, 0x9a3c, 0x72f0d28, 0x7f9c5c7a5808)
	/usr/local/go/src/runtime/mheap.go:1150 +0x37f fp=0x7f9c64bfd990 sp=0x7f9c64bfd958 pc=0x784d1f
runtime.(*mheap).alloc_m(0x72ae560, 0x9a3c, 0x7f9c5ae50101, 0x72ae570)
	/usr/local/go/src/runtime/mheap.go:977 +0xc2 fp=0x7f9c64bfd9e0 sp=0x7f9c64bfd990 pc=0x784372
runtime.(*mheap).alloc.func1()
	/usr/local/go/src/runtime/mheap.go:1048 +0x4c fp=0x7f9c64bfda18 sp=0x7f9c64bfd9e0 pc=0x7b8afc
runtime.(*mheap).alloc(0x72ae560, 0x9a3c, 0x7f9c64000101, 0x7f9c5ae55d08)
	/usr/local/go/src/runtime/mheap.go:1047 +0x8a fp=0x7f9c64bfda68 sp=0x7f9c64bfda18 pc=0x78464a
runtime.largeAlloc(0x13478000, 0x7b0100, 0x7f9c5ae55d08)
	/usr/local/go/src/runtime/malloc.go:1055 +0x99 fp=0x7f9c64bfdaa8 sp=0x7f9c64bfda68 pc=0x76beb9
runtime.mallocgc.func1()
	/usr/local/go/src/runtime/malloc.go:950 +0x46 fp=0x7f9c64bfdad8 sp=0x7f9c64bfdaa8 pc=0x7b7a86
runtime.systemstack(0x7f9c6b497c50)
	/usr/local/go/src/runtime/asm_amd64.s:351 +0x66 fp=0x7f9c64bfdae0 sp=0x7f9c64bfdad8 pc=0x7bae36
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1153 fp=0x7f9c64bfdae8 sp=0x7f9c64bfdae0 pc=0x791a70

cockroach.log

Repro steps:
roachprod create $CLUSTER -n 31 --clouds=aws --aws-machine-type-ssd=c5d.2xlarge
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-30 cockroach
roachprod stage $CLUSTER:31 workload
roachprod start $CLUSTER:1-30 --racks=10 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"

The text was updated successfully, but these errors were encountered:

awoods187 · 2019-09-23T23:19:52Z

Heap profile
memprof.000000007038819200_2019-09-23T21_07_56.534.zip

awoods187 · 2019-09-23T23:33:58Z

Also, why is one range still under replicated?

dt · 2019-09-24T15:32:15Z

How much total mem did these nodes have?

The heap profile does indicate that the buffering adders were indeed using a lot of memory -- as they are supposed to since they use as much as the budget from --max-sql-memory provides them. They were using 2.3gb in that profile.

There are some hefty amounts on the kv side as well -- half a gb just in unmarhalling ssts kv commands -- that I'm wondering might have pushed it over the edge. I don't know what our mem accounting / control looks like in kv.

awoods187 · 2019-09-24T15:36:32Z

https://aws.amazon.com/blogs/aws/ec2-instance-update-c5-instances-with-local-nvme-storage-c5d/

dt · 2019-09-24T15:39:06Z

Yeah, so 2.3gb in the buffering adders doesn't seem like why they OOM'ed. the total go profile above only accounts for <4gb. If I had to guess, IMPORT's processor buffering contributed to memory pressure (but likely accounted for no more than 25% of it), but something else seems like it must have been what pushed it over the edge.

awoods187 · 2019-09-24T17:24:20Z

This repros. I ran the exact same setup today and observed a node failure:

Cluster is up: andy-30node

awoods187 · 2019-09-26T15:28:52Z

This worked in the run-up to 19.1 as evidenced in this issue . Do we need to fix this for 19.2?

awoods187 · 2019-09-27T19:41:38Z

I retried this today and only modified the start to:
roachprod start $CLUSTER:1-30 --racks=10 --args "--max-sql-memory=.20"
The import completed without crashing and the highest memore usge for any one node is 62% all there is a large range to as low as 38%

Should we lower the default memory usage in 19.2? I just realized this is for the entire cluster and we probably can't change it. Instead we need to grant less memory to the import I think

pbardea · 2019-10-03T18:13:33Z

Going to take a look to see if I can repro and get a profile for the import with and without direct ingestion.

dt · 2019-10-03T21:00:06Z

note that @nvanbenschoten just found (and fixed) a significant memory leak in the RocksDB logging. I didn't see it in Andy's profile above though.

pbardea · 2019-10-09T15:27:25Z

I've ran this several times over the last week and could not discern any meaningful difference between heap profiles with and without direct ingestion. Both complete on my clusters when I run the above tests. @awoods187 Have you been able to repro this lately? It seems that maybe a fix on master recently may have solved this (possibly the logging leak fix?).

awoods187 · 2019-10-09T15:29:27Z

I haven't run this since i last posted. I can try again if you think it'd be helpful

pbardea · 2019-10-09T20:12:12Z

The cluster that I completed the latest successful run is still up: paul-test (now expired).

If you could try to repro to see if you're still getting it would be great. If the issue persists, feel free to ping this issue and I can try to look at the logs/profiles (as I haven't been able to get an OOM out of the 5 times I've run this).

awoods187 · 2019-10-10T17:03:49Z

I ran today on v19.2.0-beta.20190930-234-g6d79c7a and was not able to reproduce. I'm a bit perplexed as to why it got fixed as discussed in the release blockers meeting but happy it seems to have been addressed.

awoods187 changed the title ~~panic during fixtures import~~ OOM panic during fixtures import Sep 23, 2019

awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels Sep 23, 2019

awoods187 mentioned this issue Sep 24, 2019

Connection timed out + memory budget exceeded when partitioning 30 node 8 cpu cluster #35096

Closed

andy-kimball mentioned this issue Sep 26, 2019

19.2 release blockers list #40447

Closed

53 tasks

pbardea self-assigned this Oct 3, 2019

awoods187 closed this as completed Oct 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM panic during fixtures import #41009

OOM panic during fixtures import #41009

awoods187 commented Sep 23, 2019

awoods187 commented Sep 23, 2019

awoods187 commented Sep 23, 2019

dt commented Sep 24, 2019

awoods187 commented Sep 24, 2019

dt commented Sep 24, 2019

awoods187 commented Sep 24, 2019

awoods187 commented Sep 26, 2019

awoods187 commented Sep 27, 2019 •

edited

Loading

pbardea commented Oct 3, 2019

dt commented Oct 3, 2019

pbardea commented Oct 9, 2019

awoods187 commented Oct 9, 2019

pbardea commented Oct 9, 2019 •

edited

Loading

awoods187 commented Oct 10, 2019

OOM panic during fixtures import #41009

OOM panic during fixtures import #41009

Comments

awoods187 commented Sep 23, 2019

awoods187 commented Sep 23, 2019

awoods187 commented Sep 23, 2019

dt commented Sep 24, 2019

awoods187 commented Sep 24, 2019

dt commented Sep 24, 2019

awoods187 commented Sep 24, 2019

awoods187 commented Sep 26, 2019

awoods187 commented Sep 27, 2019 • edited Loading

pbardea commented Oct 3, 2019

dt commented Oct 3, 2019

pbardea commented Oct 9, 2019

awoods187 commented Oct 9, 2019

pbardea commented Oct 9, 2019 • edited Loading

awoods187 commented Oct 10, 2019

awoods187 commented Sep 27, 2019 •

edited

Loading

pbardea commented Oct 9, 2019 •

edited

Loading