spl_kmem_cache() spinning #1454

behlendorf · 2013-05-09T21:17:11Z

Observed while running mds-survey.

 file_count=1000000 dir_count=1 thrlo=128 thrhi=512 mds-survey

BUG: soft lockup - CPU#0 stuck for 67s! [spl_kmem_cache/:6268]
Pid: 6268, comm: spl_kmem_cache/ Tainted: P           ---------------    2.6.32-358.6.1.2chaos.ch5.1.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[<ffffffff81284e2d>]  [<ffffffff81284e2d>] __bitmap_empty+0x3d/0x90
RSP: 0018:ffff880325a9b4b0  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff880325a9b4b0 RCX: 0000000000000018
RDX: 0000000000000000 RSI: 0000000000000018 RDI: ffffffff81e22898
RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000
R10: ffff880028402700 R11: 0000000000000000 R12: 000000000044c000
R13: ffffffff8104c658 R14: ffff880325a9b460 R15: ffffffff810339ae
FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaaaf5c750 CR3: 000000036626d000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process spl_kmem_cache/ (pid: 6268, threadinfo ffff880325a9a000, task ffff880325a7c040)
Stack:
 ffff880325a9b500 ffffffff8104c658 0000000000430000 ffffffff81e22880
<d> ffffffff81aa9100 ffff880638a60e40 0000000000430000 ffff880638a61108
<d> ffffea000b45f5d0 ffff880325a9b66c ffff880325a9b530 ffffffff8104c6d6
Call Trace:
 [<ffffffff8104c658>] ? flush_tlb_others_ipi+0x128/0x130
 [<ffffffff8104c6d6>] ? native_flush_tlb_others+0x76/0x90
 [<ffffffff8104c7fe>] ? flush_tlb_page+0x5e/0xb0
 [<ffffffff8104b740>] ? ptep_clear_flush_young+0x50/0x70
 [<ffffffff8114d134>] ? page_referenced_one+0xa4/0x1e0
 [<ffffffff8127b8e6>] ? prio_tree_next+0x216/0x250
 [<ffffffff8114db38>] ? page_referenced+0x148/0x360
 [<ffffffff8113201d>] ? shrink_page_list.clone.3+0x17d/0x650
 [<ffffffff81132ed3>] ? shrink_inactive_list+0x343/0x830
 [<ffffffff810572f0>] ? __dequeue_entity+0x30/0x50
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffffa08b5259>] ? cl_env_hops_keycmp+0x19/0x70 [obdclass]
 [<ffffffff8113376e>] ? shrink_mem_cgroup_zone+0x3ae/0x610
 [<ffffffff8117216d>] ? mem_cgroup_iter+0xfd/0x280
 [<ffffffff81133a33>] ? shrink_zone+0x63/0xb0
 [<ffffffff81133b95>] ? do_try_to_free_pages+0x115/0x610
 [<ffffffff81134262>] ? try_to_free_pages+0x92/0x120
 [<ffffffff8112bab8>] ? __alloc_pages_nodemask+0x478/0x8d0
 [<ffffffff8116054a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff811298ce>] ? __get_free_pages+0xe/0x50
 [<ffffffffa03b19f7>] ? kv_alloc+0x37/0x60 [spl]
 [<ffffffffa03b1a59>] ? spl_cache_grow_work+0x39/0x2d0 [spl]
 [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 
 [<ffffffffa03b3277>] ? taskq_thread+0x1e7/0x3f0 [spl]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa03b3090>] ? taskq_thread+0x0/0x3f0 [spl]
 [<ffffffff81096c76>] ? kthread+0x96/0xa0 
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff81096be0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

The text was updated successfully, but these errors were encountered:

ryao · 2013-06-06T15:40:01Z

Did your system have no swap, swap on a zvol or swap on a non-zvol?

If this occurred when using swap on a zvol, ryao/spl@5717902 might fix this.

behlendorf · 2013-06-06T17:05:58Z

@ryao This occurred on the MDS in a Lustre+ZFS configuration. The node was running diskless and there was swap device on the system (zvol or otherwise). I suspect Lustre may be to blame here by progressively consuming more memory and never providing a shrinker hook to release it. But I haven't investigate yet.

Has your ryao/spl@5717902 patch been observed to improve things on systems with zvol swap? I can see how it might, but as you point out there also a distinct possibility the system will stall.

ryao · 2013-06-06T17:57:54Z

I did not control variables tightly enough when testing ryao/spl@5717902 earlier. It does not appear to have any effect. The additional issue that I spotted might be happening, but determining that requires writing a patch for it and doing more testing.

ryao · 2014-07-11T20:25:02Z

@behlendorf If you can still reproduce this, it would be helpful to profile this using perf and generate a flame graph like I described here:

#2240 (comment)

ryao · 2014-07-16T16:53:26Z

@behlendorf I wrote some notes on how to use flame graphs on the Gentoo Wiki:

https://wiki.gentoo.org/wiki/ZFSOnLinux_Development_Guide#Flame_Graphs

behlendorf · 2014-07-17T22:19:49Z

I've been unable to reproduce this issue with the latest code. I'm going to close it out.

behlendorf mentioned this issue Jun 7, 2013

spl_kmem_cache high CPU usage openzfs/spl#234

Closed

ryao mentioned this issue Jun 10, 2013

spl_kmem_cache spinning problem (again) combined with txg_sync hung task openzfs/spl#247

Closed

behlendorf closed this as completed Jul 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spl_kmem_cache() spinning #1454

spl_kmem_cache() spinning #1454

behlendorf commented May 9, 2013

ryao commented Jun 6, 2013

behlendorf commented Jun 6, 2013

ryao commented Jun 6, 2013

ryao commented Jul 11, 2014

ryao commented Jul 16, 2014

behlendorf commented Jul 17, 2014

spl_kmem_cache() spinning #1454

spl_kmem_cache() spinning #1454

Comments

behlendorf commented May 9, 2013

ryao commented Jun 6, 2013

behlendorf commented Jun 6, 2013

ryao commented Jun 6, 2013

ryao commented Jul 11, 2014

ryao commented Jul 16, 2014

behlendorf commented Jul 17, 2014