Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM panic during fixtures import #41009

Closed
awoods187 opened this issue Sep 23, 2019 · 14 comments
Closed

OOM panic during fixtures import #41009

awoods187 opened this issue Sep 23, 2019 · 14 comments
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.

Comments

@awoods187
Copy link
Contributor

While importing, I observed a dead node:
image
From the logs on the node:

runtime stack:
runtime.throw(0x3f8471c, 0x16)
	/usr/local/go/src/runtime/panic.go:617 +0x72 fp=0x7f9c64bfd818 sp=0x7f9c64bfd7e8 pc=0x78d4c2
runtime.sysMap(0xc248000000, 0x14000000, 0x72f0d18)
	/usr/local/go/src/runtime/mem_linux.go:170 +0xc7 fp=0x7f9c64bfd858 sp=0x7f9c64bfd818 pc=0x777b17
runtime.(*mheap).sysAlloc(0x72ae560, 0x13478000, 0x72ae570, 0x9a3c)
	/usr/local/go/src/runtime/malloc.go:633 +0x1cd fp=0x7f9c64bfd900 sp=0x7f9c64bfd858 pc=0x76aa9d
runtime.(*mheap).grow(0x72ae560, 0x9a3c, 0x0)
	/usr/local/go/src/runtime/mheap.go:1222 +0x42 fp=0x7f9c64bfd958 sp=0x7f9c64bfd900 pc=0x784e32
runtime.(*mheap).allocSpanLocked(0x72ae560, 0x9a3c, 0x72f0d28, 0x7f9c5c7a5808)
	/usr/local/go/src/runtime/mheap.go:1150 +0x37f fp=0x7f9c64bfd990 sp=0x7f9c64bfd958 pc=0x784d1f
runtime.(*mheap).alloc_m(0x72ae560, 0x9a3c, 0x7f9c5ae50101, 0x72ae570)
	/usr/local/go/src/runtime/mheap.go:977 +0xc2 fp=0x7f9c64bfd9e0 sp=0x7f9c64bfd990 pc=0x784372
runtime.(*mheap).alloc.func1()
	/usr/local/go/src/runtime/mheap.go:1048 +0x4c fp=0x7f9c64bfda18 sp=0x7f9c64bfd9e0 pc=0x7b8afc
runtime.(*mheap).alloc(0x72ae560, 0x9a3c, 0x7f9c64000101, 0x7f9c5ae55d08)
	/usr/local/go/src/runtime/mheap.go:1047 +0x8a fp=0x7f9c64bfda68 sp=0x7f9c64bfda18 pc=0x78464a
runtime.largeAlloc(0x13478000, 0x7b0100, 0x7f9c5ae55d08)
	/usr/local/go/src/runtime/malloc.go:1055 +0x99 fp=0x7f9c64bfdaa8 sp=0x7f9c64bfda68 pc=0x76beb9
runtime.mallocgc.func1()
	/usr/local/go/src/runtime/malloc.go:950 +0x46 fp=0x7f9c64bfdad8 sp=0x7f9c64bfdaa8 pc=0x7b7a86
runtime.systemstack(0x7f9c6b497c50)
	/usr/local/go/src/runtime/asm_amd64.s:351 +0x66 fp=0x7f9c64bfdae0 sp=0x7f9c64bfdad8 pc=0x7bae36
runtime.mstart()
	/usr/local/go/src/runtime/proc.go:1153 fp=0x7f9c64bfdae8 sp=0x7f9c64bfdae0 pc=0x791a70

cockroach.log

Repro steps:
roachprod create $CLUSTER -n 31 --clouds=aws --aws-machine-type-ssd=c5d.2xlarge
roachprod run $CLUSTER -- "DEV=$(mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier ${DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod stage $CLUSTER:1-30 cockroach
roachprod stage $CLUSTER:31 workload
roachprod start $CLUSTER:1-30 --racks=10 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"

@awoods187 awoods187 changed the title panic during fixtures import OOM panic during fixtures import Sep 23, 2019
@awoods187 awoods187 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. labels Sep 23, 2019
@awoods187
Copy link
Contributor Author

@awoods187
Copy link
Contributor Author

Also, why is one range still under replicated?

@dt
Copy link
Member

dt commented Sep 24, 2019

How much total mem did these nodes have?

The heap profile does indicate that the buffering adders were indeed using a lot of memory -- as they are supposed to since they use as much as the budget from --max-sql-memory provides them. They were using 2.3gb in that profile.

There are some hefty amounts on the kv side as well -- half a gb just in unmarhalling ssts kv commands -- that I'm wondering might have pushed it over the edge. I don't know what our mem accounting / control looks like in kv.

@awoods187
Copy link
Contributor Author

Screen Shot 2019-09-24 at 8 35 49 AM

https://aws.amazon.com/blogs/aws/ec2-instance-update-c5-instances-with-local-nvme-storage-c5d/

@dt
Copy link
Member

dt commented Sep 24, 2019

Yeah, so 2.3gb in the buffering adders doesn't seem like why they OOM'ed. the total go profile above only accounts for <4gb. If I had to guess, IMPORT's processor buffering contributed to memory pressure (but likely accounted for no more than 25% of it), but something else seems like it must have been what pushed it over the edge.

@awoods187
Copy link
Contributor Author

This repros. I ran the exact same setup today and observed a node failure:
image

Cluster is up: andy-30node

@awoods187
Copy link
Contributor Author

This worked in the run-up to 19.1 as evidenced in this issue . Do we need to fix this for 19.2?

@awoods187
Copy link
Contributor Author

awoods187 commented Sep 27, 2019

I retried this today and only modified the start to:
roachprod start $CLUSTER:1-30 --racks=10 --args "--max-sql-memory=.20"
The import completed without crashing and the highest memore usge for any one node is 62% all there is a large range to as low as 38%

image

Should we lower the default memory usage in 19.2? I just realized this is for the entire cluster and we probably can't change it. Instead we need to grant less memory to the import I think

@pbardea pbardea self-assigned this Oct 3, 2019
@pbardea
Copy link
Contributor

pbardea commented Oct 3, 2019

Going to take a look to see if I can repro and get a profile for the import with and without direct ingestion.

@dt
Copy link
Member

dt commented Oct 3, 2019

note that @nvanbenschoten just found (and fixed) a significant memory leak in the RocksDB logging. I didn't see it in Andy's profile above though.

@pbardea
Copy link
Contributor

pbardea commented Oct 9, 2019

I've ran this several times over the last week and could not discern any meaningful difference between heap profiles with and without direct ingestion. Both complete on my clusters when I run the above tests. @awoods187 Have you been able to repro this lately? It seems that maybe a fix on master recently may have solved this (possibly the logging leak fix?).

@awoods187
Copy link
Contributor Author

I haven't run this since i last posted. I can try again if you think it'd be helpful

@pbardea
Copy link
Contributor

pbardea commented Oct 9, 2019

The cluster that I completed the latest successful run is still up: paul-test (now expired).

If you could try to repro to see if you're still getting it would be great. If the issue persists, feel free to ping this issue and I can try to look at the logs/profiles (as I haven't been able to get an OOM out of the 5 times I've run this).

@awoods187
Copy link
Contributor Author

I ran today on v19.2.0-beta.20190930-234-g6d79c7a and was not able to reproduce. I'm a bit perplexed as to why it got fixed as discussed in the release blockers meeting but happy it seems to have been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting.
Projects
None yet
Development

No branches or pull requests

3 participants