-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hung import during tpc-c #34499
Comments
Grabbed the goroutine dump from the stuck node: https://gist.github.com/awoods187/4e2b4716a0a4bfcdd02e840de527a5f4 |
The tons and tons of goroutines stuck like this are why we dumped the goroutines:
Seems like a deadlock somewhere. |
Cc @petermattis. I'm also planning to take a look but it won't be before next week |
Andy can you check that you're not simply running out of iops quota or the like on AWS? The goroutine dump unfortunately doesn't help much if the problem is really that something is stuck at the Rocks level. We'd need to catch this in action and then jump on it. |
I don't think I ran out of anything here because it was SSD on AWS which scales iops accordingly (on |
Seems to be stuckage deep inside RocksDB:
The goroutine above is the one blocking all other RocksDB batch commits.
So 4 different goroutines stuck in RocksDB. @awoods187 How reproducible is this? I think I'd need to inspect a running cluster to see what is going on. Ideally I'd get a core dump which I can spelunk with a debugger. |
I've only seen it once so not very reproducible. I can let you know if i see it again or we can try on that particular sha and setup a few times |
Describe the problem
Node crashed during import but the import is hung instead of failed.
To Reproduce
export CLUSTER=andy-nolbs
roachprod create $CLUSTER -n 16 --clouds=aws --aws-machine-type-ssd=c5d.4xlarge
roachprod run $CLUSTER:1-15 -- 'sudo umount /mnt/data1; sudo mount -o discard,defaults,nobarrier /dev/nvme1n1 /mnt/data1/; mount | grep /mnt/data1'
roachprod stage $CLUSTER:1-15 cockroach
roachprod stage $CLUSTER:16 workload
roachprod start $CLUSTER:1-15 --racks=5 -e COCKROACH_ENGINE_MAX_SYNC_DURATION=24h
Turn off load-based splitting before loading fixtures:
roachprod sql $CLUSTER:1
SET CLUSTER SETTING kv.range_split.by_load_enabled = false
roachprod adminurl --open $CLUSTER:1
roachprod run $CLUSTER:1 -- "./cockroach workload fixtures import tpcc --warehouses=10000 --db=tpcc"
Environment:
v2.2.0-alpha.20181217-988-g63f8ee7
The text was updated successfully, but these errors were encountered: