Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: splits/largerange/size=32GiB,nodes=6 failed [OOM] #94371

Closed
cockroach-teamcity opened this issue Dec 28, 2022 · 12 comments
Closed

roachtest: splits/largerange/size=32GiB,nodes=6 failed [OOM] #94371

cockroach-teamcity opened this issue Dec 28, 2022 · 12 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Dec 28, 2022

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 6723e00a46aaa3ea575093bd82a02b7d6f6b131b:

test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1
(test_impl.go:291).Fatal: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-22859

Epic CRDB-18656

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 28, 2022
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Dec 28, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Dec 28, 2022
@cockroach-teamcity
Copy link
Member Author

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 64e4fc9faa4e0ab19fe5ba78f053bc2b1390cb5e:

test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 250238cd29102391dddbc8cc71380090c49ce509:

test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1
(sql_runner.go:194).Query: error executing '
select concat('r', range_id::string) as range, voting_replicas
from [ SHOW RANGES FROM DATABASE bank ] where cardinality(voting_replicas) >= $1;': read tcp 172.17.0.3:59572 -> 34.23.31.153:26257: read: connection reset by peer
(monitor.go:127).Wait: monitor failure: monitor task failed: t.Fatal() was called

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 0725273ac7f789ba8ed78aacaf73cc953ca47fe8:

test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 0725273ac7f789ba8ed78aacaf73cc953ca47fe8:

test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ 2c3d75f1ce31024d7ffe530f91f22162c053abcd:

test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1
(sql_runner.go:194).Query: error executing '
select concat('r', range_id::string) as range, voting_replicas
from [ SHOW RANGES FROM DATABASE bank ] where cardinality(voting_replicas) >= $1;': dial tcp 34.148.222.162:26257: connect: connection refused
(monitor.go:127).Wait: monitor failure: monitor task failed: t.Fatal() was called

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@erikgrinaker
Copy link
Contributor

n1 OOM-killed. Pretty sure this is fallout from #92858, which also caused OOMs in #93579, and is possibly fixed by #93823. Reassigning to @kvoli to confirm.

[ 8137.222561] cockroach invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 8137.222566] CPU: 1 PID: 39023 Comm: cockroach Not tainted 5.4.0-1044-gcp #47-Ubuntu
[ 8137.222568] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
[ 8137.222569] Call Trace:
[ 8137.222579]  dump_stack+0x6d/0x8b
[ 8137.222583]  dump_header+0x4f/0x1eb
[ 8137.222584]  oom_kill_process.cold+0xb/0x10
[ 8137.222588]  out_of_memory.part.0+0x1df/0x3d0
[ 8137.222589]  out_of_memory+0x6d/0xd0
[ 8137.222592]  mem_cgroup_out_of_memory+0xbd/0xe0
[ 8137.222595]  try_charge+0x77c/0x810
[ 8137.222597]  mem_cgroup_try_charge+0x71/0x190
[ 8137.222599]  __add_to_page_cache_locked+0x265/0x340
[ 8137.222602]  ? scan_shadow_nodes+0x30/0x30
[ 8137.222603]  add_to_page_cache_lru+0x4d/0xd0
[ 8137.222604]  pagecache_get_page+0x101/0x300
[ 8137.222606]  filemap_fault+0x6b2/0xa50
[ 8137.222607]  ? unlock_page_memcg+0x12/0x20
[ 8137.222610]  ? page_add_file_rmap+0xff/0x1a0
[ 8137.222611]  ? xas_load+0xd/0x80
[ 8137.222612]  ? xas_find+0x17f/0x1c0
[ 8137.222614]  ? filemap_map_pages+0x24c/0x380
[ 8137.222617]  ext4_filemap_fault+0x32/0x50
[ 8137.222618]  __do_fault+0x3c/0x130
[ 8137.222620]  do_fault+0x24b/0x640
[ 8137.222622]  __handle_mm_fault+0x4c5/0x7a0
[ 8137.222624]  handle_mm_fault+0xca/0x200
[ 8137.222627]  do_user_addr_fault+0x1f9/0x450
[ 8137.222628]  __do_page_fault+0x58/0x90
[ 8137.222629]  do_page_fault+0x2c/0xe0
[ 8137.222633]  page_fault+0x34/0x40
[ 8137.222635] RIP: 0033:0x67c52e
[ 8137.222637] Code: c0 4d 3b 66 10 0f 86 d1 0d 00 00 48 81 ec c0 00 00 00 48 89 ac 24 b8 00 00 00 48 8d ac 24 b8 00 00 00 48 89 84 24 c8 00 00 00 <4c> 8b 0d 63 3a 2c 09 4c 8b 15 54 3a 2c 09 4c 8b 1d f5 39 2c 09 4c
[ 8137.222638] RSP: 002b:000000c005893230 EFLAGS: 00010206
[ 8137.222639] RAX: 000000c07f6663c0 RBX: 0000000000000007 RCX: 0000000000000040
[ 8137.222640] RDX: 000000c07f6663c0 RSI: 0000000000000000 RDI: 0000000000000000
[ 8137.222641] RBP: 000000c0058932e8 R08: 0000000000000000 R09: 0000000000000000
[ 8137.222642] R10: 00007f0e0cbad3d8 R11: 0000000000000040 R12: 000000c0058932b0
[ 8137.222642] R13: 0000000000000000 R14: 000000c0060b49c0 R15: ffffffffffffffff
[ 8137.222645] memory: usage 14594444kB, limit 14594444kB, failcnt 4617261
[ 8137.222645] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[ 8137.222646] kmem: usage 51880kB, limit 9007199254740988kB, failcnt 0
[ 8137.222647] Memory cgroup stats for /system.slice/cockroach.service:
[ 8137.222657] anon 14890971136
               file 253952
               kernel_stack 184320
               slab 21319680
               sock 0
               shmem 0
               file_mapped 0
               file_dirty 135168
               file_writeback 270336
               anon_thp 117440512
               inactive_anon 0
               active_anon 14890999808
               inactive_file 0
               active_file 217088
               unevictable 0
               slab_reclaimable 20283392
               slab_unreclaimable 1036288
               pgfault 20381229
               pgmajfault 990
               workingset_refault 2023098
               workingset_activate 49962
               workingset_nodereclaim 0
               pgrefill 18433
               pgscan 8838721
               pgsteal 8814947
               pgactivate 12408
               pgdeactivate 16641
               pglazyfree 0
               pglazyfreed 0
               thp_fault_alloc 66
               thp_collapse_alloc 0
[ 8137.222658] Tasks state (memory values in pages):
[ 8137.222658] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 8137.222661] [  38401]  1000 38401     2154      618    57344        0             0 bash
[ 8137.222662] [  38406]  1000 38406  4099404  3643144 31518720        0             0 cockroach
[ 8137.222664] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/system.slice/cockroach.service,task_memcg=/system.slice/cockroach.service,task=cockroach,pid=38406,uid=1000
[ 8137.222716] Memory cgroup out of memory: Killed process 38406 (cockroach) total-vm:16397616kB, anon-rss:14540508kB, file-rss:32068kB, shmem-rss:0kB, UID:1000 pgtables:30780kB oom_score_adj:0
[ 8137.992502] oom_reaper: reaped process 38406 (cockroach), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@erikgrinaker erikgrinaker added the X-duplicate Closed as a duplicate of another issue. label Jan 4, 2023
@erikgrinaker erikgrinaker changed the title roachtest: splits/largerange/size=32GiB,nodes=6 failed roachtest: splits/largerange/size=32GiB,nodes=6 failed [OOM] Jan 4, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.splits/largerange/size=32GiB,nodes=6 failed with artifacts on master @ d1233022fc2f67a3901c0be08ec71abb13b1e8fe:

test artifacts and logs in: /artifacts/splits/largerange/size=32GiB_nodes=6/run_1
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented Jan 5, 2023

I ran this locally and wasn't able to reproduce it with a recent commit off master. Looking at the logs the OOM occurs right after the last node joins the cluster. However, the RSS bytes for the OOM node (n1) was already at 13gb prior to this - so really it seems an issue before and is pushed over the edge by the additional metadata information from other nodes joining the cluster.

Jan 05 11:12:44 teamcity-8192493-1672899458-47-n6cpu4-0001 kernel: memory: usage 14594444kB, limit 14594444kB, failcnt 5004882

rough timeline, where the stats are from n1 which OOM'd:

11:12:32 runtime stats: 13 GiB RSS, 333 goroutines (stacks: 4.3 MiB), 5.4 GiB/9.5 GiB Go alloc/total(stale) (heap fragmentation: 166 MiB, heap reserved: 3.5 GiB, heap released: 139 MiB), 3.0 GiB/3.7 GiB CGO alloc/total (16092.2 CGO/sec), 333.6/26.8 %(u/s)time, 0.0 %gc (382x), 85 MiB/85 MiB (r/w)net

11:12:33 workload/workloadsql/dataload.go:147 [T1] 1 imported bank (33m42s, 132152839 rows)
11:12:33 cluster.go:678: test status: starting nodes :2-6
11:12:35 33 1@cli/start.go:1145 ⋮ [T1,n2] 59  node startup completed:
11:12:37 66 1@cli/start.go:1145 ⋮ [T1,n3] 62  node startup completed:
11:12:39 66 1@cli/start.go:1145 ⋮ [T1,n4] 63  node startup completed:
11:12:41 66 1@cli/start.go:1145 ⋮ [T1,n5] 68  node startup completed:

11:12:42 runtime stats: 13 GiB RSS, 485 goroutines (stacks: 6.3 MiB), 6.5 GiB/9.4 GiB Go alloc/total(stale) (heap fragmentation: 121 MiB, heap reserved: 2.3 GiB, heap released: 313 MiB), 3.2 GiB/3.8 GiB CGO alloc/total (25626.9 CGO/sec), 153.2/32.6 %(u/s)time, 0.0 %gc (382x), 8.9 MiB/502 MiB (r/w)net

I230105 11:12:43.591230 65 1@cli/start.go:1145 ⋮ [T1,n6] 69  node startup completed:
11:12:43 split.go:348: waiting for range with >= 3 replicas:
11:12:44 journalctl OOM
11:12:46 test_impl.go:347: test failure #1: (monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)

The last mem profile that was captured was from 10:56:55 when the most recent heath log shows an rss which is only 1gb off the rss reported just before failure. This makes me lean towards thinking that it may be a SQL related issue, given the profile below:

I230105 10:56:55 runtime stats: 12 GiB RSS, 355 goroutines (stacks: 5.6 MiB), 8.4 GiB/9.4 GiB Go alloc/total(stale) (heap fragmentation: 18 MiB, heap reserved: 496 MiB, heap released: 1.4 MiB), 2.0 GiB/2.4 GiB CGO alloc/total (7559.8 CGO/sec), 346.4/15.1 %(u/s)time, 0.0 %gc (265x), 112 MiB/112 MiB (r/w)net

image

I think that corresponds to this line here

placeholders[i] = tree.AsStringWithFlags(q.placeholders.Values[i], tree.FmtSimple)

Does the above profile look normal, given a 132152839 row import (inserts) running on a single node cluster w/ schema?

cc @cockroachdb/sql-queries @DrewKimball If you have a moment could you take a look at the heap pprof above and lmk if that alloc, specifically toActiveQuery() looks normal?

@yuzefovich
Copy link
Member

This seems like a recently-introduced active statements' cache. cc @matthewtodd

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. and removed X-duplicate Closed as a duplicate of another issue. labels Jan 5, 2023
@kvoli
Copy link
Collaborator

kvoli commented Jan 5, 2023

x-ref #94205 which also crashes due to OOM.

@matthewtodd
Copy link
Contributor

I've just sent out #94793 to revert the offending commit. We'll get it tightened up before reintroducing.

craig bot pushed a commit that referenced this issue Jan 6, 2023
94793: sql, server: revert new recent statements cache r=matthewtodd a=matthewtodd

This reverts #93270.

We've seen a number of roachtest and roachperf failures as a result of this work, so let's revert until we can address them.

Part of #86955

Addresses #94205
Addresses #94371
Addresses #94676

Release note: None

Co-authored-by: Matthew Todd <[email protected]>
@github-actions
Copy link

github-actions bot commented Feb 6, 2023

We have marked this test failure issue as stale because it has been
inactive for 1 month. If this failure is still relevant, removing the
stale label or adding a comment will keep it active. Otherwise,
we'll close it in 5 days to keep the test failure queue tidy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). no-test-failure-activity O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

6 participants