Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spl_kmem_cache() spinning #1454

Closed
behlendorf opened this issue May 9, 2013 · 6 comments
Closed

spl_kmem_cache() spinning #1454

behlendorf opened this issue May 9, 2013 · 6 comments
Milestone

Comments

@behlendorf
Copy link
Contributor

Observed while running mds-survey.

 file_count=1000000 dir_count=1 thrlo=128 thrhi=512 mds-survey
BUG: soft lockup - CPU#0 stuck for 67s! [spl_kmem_cache/:6268]
Pid: 6268, comm: spl_kmem_cache/ Tainted: P           ---------------    2.6.32-358.6.1.2chaos.ch5.1.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[<ffffffff81284e2d>]  [<ffffffff81284e2d>] __bitmap_empty+0x3d/0x90
RSP: 0018:ffff880325a9b4b0  EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff880325a9b4b0 RCX: 0000000000000018
RDX: 0000000000000000 RSI: 0000000000000018 RDI: ffffffff81e22898
RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000
R10: ffff880028402700 R11: 0000000000000000 R12: 000000000044c000
R13: ffffffff8104c658 R14: ffff880325a9b460 R15: ffffffff810339ae
FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaaaf5c750 CR3: 000000036626d000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process spl_kmem_cache/ (pid: 6268, threadinfo ffff880325a9a000, task ffff880325a7c040)
Stack:
 ffff880325a9b500 ffffffff8104c658 0000000000430000 ffffffff81e22880
<d> ffffffff81aa9100 ffff880638a60e40 0000000000430000 ffff880638a61108
<d> ffffea000b45f5d0 ffff880325a9b66c ffff880325a9b530 ffffffff8104c6d6
Call Trace:
 [<ffffffff8104c658>] ? flush_tlb_others_ipi+0x128/0x130
 [<ffffffff8104c6d6>] ? native_flush_tlb_others+0x76/0x90
 [<ffffffff8104c7fe>] ? flush_tlb_page+0x5e/0xb0
 [<ffffffff8104b740>] ? ptep_clear_flush_young+0x50/0x70
 [<ffffffff8114d134>] ? page_referenced_one+0xa4/0x1e0
 [<ffffffff8127b8e6>] ? prio_tree_next+0x216/0x250
 [<ffffffff8114db38>] ? page_referenced+0x148/0x360
 [<ffffffff8113201d>] ? shrink_page_list.clone.3+0x17d/0x650
 [<ffffffff81132ed3>] ? shrink_inactive_list+0x343/0x830
 [<ffffffff810572f0>] ? __dequeue_entity+0x30/0x50
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
 [<ffffffffa08b5259>] ? cl_env_hops_keycmp+0x19/0x70 [obdclass]
 [<ffffffff8113376e>] ? shrink_mem_cgroup_zone+0x3ae/0x610
 [<ffffffff8117216d>] ? mem_cgroup_iter+0xfd/0x280
 [<ffffffff81133a33>] ? shrink_zone+0x63/0xb0
 [<ffffffff81133b95>] ? do_try_to_free_pages+0x115/0x610
 [<ffffffff81134262>] ? try_to_free_pages+0x92/0x120
 [<ffffffff8112bab8>] ? __alloc_pages_nodemask+0x478/0x8d0
 [<ffffffff8116054a>] ? alloc_pages_current+0xaa/0x110
 [<ffffffff811298ce>] ? __get_free_pages+0xe/0x50
 [<ffffffffa03b19f7>] ? kv_alloc+0x37/0x60 [spl]
 [<ffffffffa03b1a59>] ? spl_cache_grow_work+0x39/0x2d0 [spl]
 [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 
 [<ffffffffa03b3277>] ? taskq_thread+0x1e7/0x3f0 [spl]
 [<ffffffff81063310>] ? default_wake_function+0x0/0x20
 [<ffffffffa03b3090>] ? taskq_thread+0x0/0x3f0 [spl]
 [<ffffffff81096c76>] ? kthread+0x96/0xa0 
 [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
 [<ffffffff81096be0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
@ryao
Copy link
Contributor

ryao commented Jun 6, 2013

Did your system have no swap, swap on a zvol or swap on a non-zvol?

If this occurred when using swap on a zvol, ryao/spl@5717902 might fix this.

@behlendorf
Copy link
Contributor Author

@ryao This occurred on the MDS in a Lustre+ZFS configuration. The node was running diskless and there was swap device on the system (zvol or otherwise). I suspect Lustre may be to blame here by progressively consuming more memory and never providing a shrinker hook to release it. But I haven't investigate yet.

Has your ryao/spl@5717902 patch been observed to improve things on systems with zvol swap? I can see how it might, but as you point out there also a distinct possibility the system will stall.

@ryao
Copy link
Contributor

ryao commented Jun 6, 2013

I did not control variables tightly enough when testing ryao/spl@5717902 earlier. It does not appear to have any effect. The additional issue that I spotted might be happening, but determining that requires writing a patch for it and doing more testing.

@ryao
Copy link
Contributor

ryao commented Jul 11, 2014

@behlendorf If you can still reproduce this, it would be helpful to profile this using perf and generate a flame graph like I described here:

#2240 (comment)

@ryao
Copy link
Contributor

ryao commented Jul 16, 2014

@behlendorf I wrote some notes on how to use flame graphs on the Gentoo Wiki:

https://wiki.gentoo.org/wiki/ZFSOnLinux_Development_Guide#Flame_Graphs

@behlendorf
Copy link
Contributor Author

I've been unable to reproduce this issue with the latest code. I'm going to close it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants