-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM panic during fixtures import #41009
Comments
Also, why is one range still under replicated? |
How much total mem did these nodes have? The heap profile does indicate that the buffering adders were indeed using a lot of memory -- as they are supposed to since they use as much as the budget from There are some hefty amounts on the kv side as well -- half a gb just in unmarhalling ssts kv commands -- that I'm wondering might have pushed it over the edge. I don't know what our mem accounting / control looks like in kv. |
Yeah, so 2.3gb in the buffering adders doesn't seem like why they OOM'ed. the total go profile above only accounts for <4gb. If I had to guess, IMPORT's processor buffering contributed to memory pressure (but likely accounted for no more than 25% of it), but something else seems like it must have been what pushed it over the edge. |
This worked in the run-up to 19.1 as evidenced in this issue . Do we need to fix this for 19.2? |
I retried this today and only modified the start to: Should we lower the default memory usage in 19.2? I just realized this is for the entire cluster and we probably can't change it. Instead we need to grant less memory to the import I think |
Going to take a look to see if I can repro and get a profile for the import with and without direct ingestion. |
note that @nvanbenschoten just found (and fixed) a significant memory leak in the RocksDB logging. I didn't see it in Andy's profile above though. |
I've ran this several times over the last week and could not discern any meaningful difference between heap profiles with and without direct ingestion. Both complete on my clusters when I run the above tests. @awoods187 Have you been able to repro this lately? It seems that maybe a fix on master recently may have solved this (possibly the logging leak fix?). |
I haven't run this since i last posted. I can try again if you think it'd be helpful |
The cluster that I completed the latest successful run is still up: If you could try to repro to see if you're still getting it would be great. If the issue persists, feel free to ping this issue and I can try to look at the logs/profiles (as I haven't been able to get an OOM out of the 5 times I've run this). |
I ran today on v19.2.0-beta.20190930-234-g6d79c7a and was not able to reproduce. I'm a bit perplexed as to why it got fixed as discussed in the release blockers meeting but happy it seems to have been addressed. |
While importing, I observed a dead node:
![image](https://user-images.githubusercontent.com/22278911/65469630-f60cdf00-de1c-11e9-8b25-d3925cba8859.png)
From the logs on the node:
cockroach.log
Repro steps:$CLUSTER -- "DEV=$ (mount | grep /mnt/data1 | awk '{print $1}'); sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier $ {DEV} /mnt/data1/; mount | grep /mnt/data1"
roachprod create $CLUSTER -n 31 --clouds=aws --aws-machine-type-ssd=c5d.2xlarge
roachprod run
roachprod stage $CLUSTER:1-30 cockroach
roachprod stage $CLUSTER:31 workload
roachprod start $CLUSTER:1-30 --racks=10 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"
The text was updated successfully, but these errors were encountered: