Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/splits/nodes=3/quiesce=true failed #93579

Closed
cockroach-teamcity opened this issue Dec 14, 2022 · 19 comments · Fixed by #93823
Closed

roachtest: kv/splits/nodes=3/quiesce=true failed #93579

cockroach-teamcity opened this issue Dec 14, 2022 · 19 comments · Fixed by #93823
Assignees
Labels
A-kv-observability branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 14, 2022

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 052acc88ad9d7296ce6b8b441627fb469cc74d95:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(test_impl.go:291).Fatal: output in run_062628.605496985_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-22415

Epic CRDB-18656

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 14, 2022
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Dec 14, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Dec 14, 2022
@kvoli
Copy link
Collaborator

kvoli commented Dec 14, 2022

Node 1 OOM'd here. This is most likely due to #92858.

The additional memory overhead per replica introduced in that patch caused the node to oom before it could reach 300k replicas.

I'll look into a fix tomorrow, this wasn't unexpected but it motivates pursuing Stat consolidation more seriously now #87187.

@kvoli kvoli self-assigned this Dec 14, 2022
kvoli added a commit to kvoli/cockroach that referenced this issue Dec 14, 2022
wip

resolves cockroachdb#87187
resolves cockroachdb#93579

Release note: None
@pav-kv pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-observability labels Dec 14, 2022
@pav-kv
Copy link
Collaborator

pav-kv commented Dec 14, 2022

@kvoli How significant is this memory overhead? Was this test already close to the OOM threshold?

I.e. should we leave the release-blocker label?

@kvoli
Copy link
Collaborator

kvoli commented Dec 14, 2022

It was already close to the threshold. This added approx 2kb per replica, So 600 mb on this test.

@pav-kv
Copy link
Collaborator

pav-kv commented Dec 14, 2022

Sounds non-trivial to me, although I don't know a typical Replica footprint. Do you know what percentage that is?

@pav-kv pav-kv removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Dec 14, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ d6f98e90684894fd36f53596e6aac355676d232e:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(test_impl.go:291).Fatal: output in run_062230.511601923_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented Dec 15, 2022

The test fails at 298k ranges (/300k).

image

Sounds non-trivial to me, although I don't know a typical Replica footprint. Do you know what percentage that is?

In the most recent failed test, the heap pprof for the oom node showed 33% of total heap mem was used for replica stats (kv only I believe)

image

This is from 72% total associated with replicas:

image

It seems to me like about 15kb per replica from the heap pprof and 8kb is attributable to replicastats.

I'm going to work on a patch now to reduce this by a constant factor per replica. Longer term I think this needs to have a lower variable cost < 1kb and a fixed cost per store.

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 93ed65565357538c9048ff45c878a493f2ed9b45:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(test_impl.go:291).Fatal: output in run_062255.852512248_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented Dec 19, 2022

I have a patch that is a medium term fix for this #93823. It reduces the memory footprint per-replica of load stats by 50% (4gb avail space now).

Longer term it would be ideal to maintain only counters on the replica and then have a fixed number of replicas tracked using more memory intensive moving average strategies.

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 15bc0c4780681d95e705c22dbe2e7d8d190054ef:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(test_impl.go:291).Fatal: output in run_062300.618391764_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 4232883add85a151c423c45904ac4096d04656c5:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(test_impl.go:291).Fatal: output in run_062210.708400693_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ c4bde8b72cdd4016845ae70ef5162b3f11fab1fb:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(test_impl.go:291).Fatal: output in run_062239.427583472_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor command failure: unexpected node event: 3: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ c4bde8b72cdd4016845ae70ef5162b3f11fab1fb:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(test_impl.go:291).Fatal: output in run_062255.354302199_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 9c5375f6a7375724cdbcbaa0029ed97a230d7abe:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(test_impl.go:291).Fatal: output in run_062722.926130212_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(test_impl.go:291).Fatal: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented Dec 28, 2022

#93823 should resolve this failing test. It would be nice to enforce a lower per-replica memory budget by increasing the number of splits. This test failing would catch creeping heap allocations earlier.

We could increase the splits in this test (or add another test) to 375-400k w/ the 14gb of available memory (4gb headroom on that patch).

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 64e4fc9faa4e0ab19fe5ba78f053bc2b1390cb5e:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(cluster.go:1933).Run: output in run_062310.145278928_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 250238cd29102391dddbc8cc71380090c49ce509:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(cluster.go:1933).Run: output in run_062649.065426369_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 0725273ac7f789ba8ed78aacaf73cc953ca47fe8:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(cluster.go:1933).Run: output in run_062201.133474248_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 1d7bd69205c2197ccac33df9e2e6d4ff8c0fdbcf:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(cluster.go:1933).Run: output in run_062226.772540067_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 2: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 2c3d75f1ce31024d7ffe530f91f22162c053abcd:

test artifacts and logs in: /artifacts/kv/splits/nodes=3/quiesce=true/run_1
(cluster.go:1933).Run: output in run_062148.310928853_n4_workload_run_kv: ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3} returned: context canceled
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants