-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: kv/splits/nodes=3/quiesce=true failed #93579
Comments
Node 1 OOM'd here. This is most likely due to #92858. The additional memory overhead per replica introduced in that patch caused the node to oom before it could reach 300k replicas. I'll look into a fix tomorrow, this wasn't unexpected but it motivates pursuing Stat consolidation more seriously now #87187. |
wip resolves cockroachdb#87187 resolves cockroachdb#93579 Release note: None
@kvoli How significant is this memory overhead? Was this test already close to the OOM threshold? I.e. should we leave the |
It was already close to the threshold. This added approx 2kb per replica, So 600 mb on this test. |
Sounds non-trivial to me, although I don't know a typical Replica footprint. Do you know what percentage that is? |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ d6f98e90684894fd36f53596e6aac355676d232e:
Parameters: |
The test fails at 298k ranges (/300k).
In the most recent failed test, the heap pprof for the oom node showed 33% of total heap mem was used for replica stats (kv only I believe) This is from 72% total associated with replicas: It seems to me like about 15kb per replica from the heap pprof and 8kb is attributable to replicastats. I'm going to work on a patch now to reduce this by a constant factor per replica. Longer term I think this needs to have a lower variable cost < 1kb and a fixed cost per store. |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 93ed65565357538c9048ff45c878a493f2ed9b45:
Parameters: |
I have a patch that is a medium term fix for this #93823. It reduces the memory footprint per-replica of load stats by 50% (4gb avail space now). Longer term it would be ideal to maintain only counters on the replica and then have a fixed number of replicas tracked using more memory intensive moving average strategies. |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 15bc0c4780681d95e705c22dbe2e7d8d190054ef:
Parameters: |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 4232883add85a151c423c45904ac4096d04656c5:
Parameters: |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ c4bde8b72cdd4016845ae70ef5162b3f11fab1fb:
Parameters: |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ c4bde8b72cdd4016845ae70ef5162b3f11fab1fb:
Parameters: |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 9c5375f6a7375724cdbcbaa0029ed97a230d7abe:
Parameters: |
#93823 should resolve this failing test. It would be nice to enforce a lower per-replica memory budget by increasing the number of splits. This test failing would catch creeping heap allocations earlier. We could increase the splits in this test (or add another test) to 375-400k w/ the 14gb of available memory (4gb headroom on that patch). |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 64e4fc9faa4e0ab19fe5ba78f053bc2b1390cb5e:
Parameters: |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 250238cd29102391dddbc8cc71380090c49ce509:
Parameters: |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 0725273ac7f789ba8ed78aacaf73cc953ca47fe8:
Parameters: |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 1d7bd69205c2197ccac33df9e2e6d4ff8c0fdbcf:
Parameters: |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 2c3d75f1ce31024d7ffe530f91f22162c053abcd:
Parameters: |
roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 052acc88ad9d7296ce6b8b441627fb469cc74d95:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_encrypted=false
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-22415
Epic CRDB-18656
The text was updated successfully, but these errors were encountered: