Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS locks system under memory pressure #266

Closed
kkoning opened this issue Jun 8, 2011 · 14 comments
Closed

ZFS locks system under memory pressure #266

kkoning opened this issue Jun 8, 2011 · 14 comments
Milestone

Comments

@kkoning
Copy link

kkoning commented Jun 8, 2011

Under memory pressure, with no swap configured (24G physical ram)

Jun 7 14:54:50 atlantis kernel: [838087.621465] INFO: task kswapd0:94 blocked for more than 120 seconds.
Jun 7 14:54:50 atlantis kernel: [838087.621482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 7 14:54:50 atlantis kernel: [838087.621488] kswapd0 D 0000000000000001 0 94 2 0x00000000
Jun 7 14:54:50 atlantis kernel: [838087.621490] ffff8803242dd7b0 0000000000000046 ffff8803242ddfd8 ffff8803242dc000
Jun 7 14:54:50 atlantis kernel: [838087.621493] 0000000000013d00 ffff8803242e03b8 ffff8803242ddfd8 0000000000013d00
Jun 7 14:54:50 atlantis kernel: [838087.621495] ffff8803270a5b80 ffff8803242e0000 ffff8803242dd7c0 ffff88063fff87b8
Jun 7 14:54:50 atlantis kernel: [838087.621497] Call Trace:
Jun 7 14:54:50 atlantis kernel: [838087.621504] [] __wait_on_freeing_inode+0x9d/0xd0
Jun 7 14:54:50 atlantis kernel: [838087.621508] [] ? wake_bit_function+0x0/0x50
Jun 7 14:54:50 atlantis kernel: [838087.621511] [] ? default_spin_lock_flags+0x9/0x10
Jun 7 14:54:50 atlantis kernel: [838087.621513] [] find_inode_fast+0x2b/0x90
Jun 7 14:54:50 atlantis kernel: [838087.621514] [] ifind_fast+0x3c/0xa0
Jun 7 14:54:50 atlantis kernel: [838087.621516] [] ilookup+0x52/0x60
Jun 7 14:54:50 atlantis kernel: [838087.621540] [] zfs_zget+0xbb/0x200 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621554] [] ? txg_list_add+0x5d/0x70 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621567] [] zfs_purgedir+0xc4/0x220 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621570] [] ? __mod_zone_page_state+0x49/0x50
Jun 7 14:54:50 atlantis kernel: [838087.621572] [] ? __free_pages+0x2d/0x40
Jun 7 14:54:50 atlantis kernel: [838087.621575] [] ? __free_slab+0xd5/0x190
Jun 7 14:54:50 atlantis kernel: [838087.621577] [] ? flush_tlb_page+0x48/0xb0
Jun 7 14:54:50 atlantis kernel: [838087.621580] [] ? cpumask_next_and+0x36/0x50
Jun 7 14:54:50 atlantis kernel: [838087.621588] [] ? dbuf_rele_and_unlock+0x1a9/0x260 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621596] [] ? dmu_buf_rele+0x30/0x40 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621606] [] ? dnode_rele+0x80/0x90 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621620] [] zfs_rmnode+0x202/0x310 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621634] [] zfs_zinactive+0xa4/0x120 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621647] [] zfs_inactive+0x5e/0x1b0 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621661] [] zpl_evict_inode+0x2f/0x40 [zfs]
Jun 7 14:54:50 atlantis kernel: [838087.621663] [] evict+0x24/0xc0
Jun 7 14:54:50 atlantis kernel: [838087.621664] [] dispose_list+0x59/0xf0
Jun 7 14:54:50 atlantis kernel: [838087.621666] [] prune_icache+0x150/0x2f0
Jun 7 14:54:50 atlantis kernel: [838087.621667] [] ? prune_dcache+0x159/0x190
Jun 7 14:54:50 atlantis kernel: [838087.621669] [] shrink_icache_memory+0x55/0x60
Jun 7 14:54:50 atlantis kernel: [838087.621673] [] shrink_slab+0x11c/0x180
Jun 7 14:54:50 atlantis kernel: [838087.621674] [] balance_pgdat+0x2d6/0x6d0
Jun 7 14:54:50 atlantis kernel: [838087.621676] [] kswapd+0x143/0x1b0
Jun 7 14:54:50 atlantis kernel: [838087.621678] [] ? kswapd+0x0/0x1b0
Jun 7 14:54:50 atlantis kernel: [838087.621679] [] kthread+0x96/0xa0
Jun 7 14:54:50 atlantis kernel: [838087.621683] [] kernel_thread_helper+0x4/0x10
Jun 7 14:54:50 atlantis kernel: [838087.621684] [] ? kthread+0x0/0xa0
Jun 7 14:54:50 atlantis kernel: [838087.621686] [] ? kernel_thread_helper+0x0/0x10
Jun 7 14:54:50 atlantis kernel: [838087.621687] INFO: task kswapd1:95 blocked for more than 120 seconds.
...

@behlendorf
Copy link
Contributor

Interesting stack! Are you able to consistently reproduce this or has it been a one time event? On the surface this looks like a pretty unlikely (but clearly possible) deadlock.

@kkoning
Copy link
Author

kkoning commented Jun 8, 2011

I was able to obtain similar behavior again under the same conditions, (processes freezing, ssh sessions unresponsive for >60s, kernel traces...) but the exact messages in syslog were different. It's a little longer this time, so here's an edited syslog.

Jun 8 15:37:04 atlantis kernel: [76792.638451] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9220082879129529129
Jun 8 15:37:15 atlantis kernel: [76793.638038] BUG: unable to handle kernel NULL pointer dereference at (null)
Jun 8 15:37:15 atlantis kernel: [76793.638057] IP: [] balance_pgdat+0x2c6/0x6d0
Jun 8 15:37:15 atlantis kernel: [76793.638068] PGD 0
Jun 8 15:37:15 atlantis kernel: [76793.638072] Oops: 0002 [#1] SMP
Jun 8 15:37:15 atlantis kernel: [76793.638078] last sysfs file: /sys/devices/system/cpu/cpu23/cache/index2/shared_cpu_map
Jun 8 15:37:15 atlantis kernel: [76793.638085] CPU 0
Jun 8 15:37:15 atlantis kernel: [76793.638087] Modules linked in: binfmt_misc zfs(P) zcommon(P) znvpair(P) zavl(P) zunicode(P) spl zlib_deflate i
p6table_filter ip6_tables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_
mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp kvm_intel kvm nouveau ttm i7core_edac edac_core drm_kms_helper joydev psmouse serio_
raw drm i2c_algo_bit video xhci_hcd f71882fg coretemp lp parport mptsas mptscsih mptbase hid_microsoft usbhid hid ahci pata_jmicron scsi_transport
_sas sky2 libahci
Jun 8 15:37:15 atlantis kernel: [76793.638194]
Jun 8 15:37:15 atlantis kernel: [76793.638197] Pid: 94, comm: kswapd0 Tainted: P 2.6.38-8-server #42-Ubuntu To Be Filled By O.E.M. To
Be Filled By O.E.M./EVGA Classified SR-2
Jun 8 15:37:15 atlantis kernel: [76793.638223] RIP: 0010:[] [] balance_pgdat+0x2c6/0x6d0
Jun 8 15:37:15 atlantis kernel: [76793.638233] RSP: 0000:ffff880324711da0 EFLAGS: 00010282Jun 8 15:37:15 atlantis kernel: [76793.638238] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 0000000000000000
Jun 8 15:37:15 atlantis kernel: [76793.638243] RDX: 0000000000000069 RSI: 00000000000000d0 RDI: 0000000000000d7d
Jun 8 15:37:15 atlantis kernel: [76793.638249] RBP: ffff880324711e90 R08: 0000000000000000 R09: 0000000000000100
Jun 8 15:37:15 atlantis kernel: [76793.638255] R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
Jun 8 15:37:15 atlantis kernel: [76793.638261] R13: ffff88033fffb000 R14: 0000000000000002 R15: ffff88033fffb000
Jun 8 15:37:15 atlantis kernel: [76793.638267] FS: 0000000000000000(0000) GS:ffff8800bf400000(0000) knlGS:0000000000000000
Jun 8 15:37:15 atlantis kernel: [76793.638273] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 8 15:37:15 atlantis kernel: [76793.638278] CR2: 0000000000000000 CR3: 0000000001a03000 CR4: 00000000000006f0
Jun 8 15:37:15 atlantis kernel: [76793.638284] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 8 15:37:15 atlantis kernel: [76793.638290] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 8 15:37:15 atlantis kernel: [76793.638296] Process kswapd0 (pid: 94, threadinfo ffff880324710000, task ffff880324708000)
Jun 8 15:37:15 atlantis kernel: [76793.638304] Stack:
Jun 8 15:37:15 atlantis kernel: [76793.638307] ffff88033fffbe00 ffff880324708000 0000000000000274 ffff880324711e44
Jun 8 15:37:15 atlantis kernel: [76793.638317] ffff880324711eac 0000000000000000 0000000100000000 0000000000000069
Jun 8 15:37:15 atlantis kernel: [76793.638327] 0000000000000000 ffff880324711e50 0000000000000000 000000003fffb000
Jun 8 15:37:15 atlantis kernel: [76793.638337] Call Trace:
Jun 8 15:37:15 atlantis kernel: [76793.638342] [] kswapd+0x143/0x1b0
Jun 8 15:37:15 atlantis kernel: [76793.638347] [] ? kswapd+0x0/0x1b0
Jun 8 15:37:15 atlantis kernel: [76793.638354] [] kthread+0x96/0xa0
Jun 8 15:37:15 atlantis kernel: [76793.638361] [] kernel_thread_helper+0x4/0x10
Jun 8 15:37:15 atlantis kernel: [76793.638366] [] ? kthread+0x0/0xa0
Jun 8 15:37:15 atlantis kernel: [76793.638371] [] ? kernel_thread_helper+0x0/0x10
Jun 8 15:37:15 atlantis kernel: [76793.638377] Code: c0 44 89 f1 4c 89 ff 48 c1 e2 03 e8 55 1b ff ff 84 c0 0f 84 d5 02 00 00 48 8b 8d 50 ff ff ff
48 8b 95 48 ff ff ff be d0 00 00 00 <48> c7 01 00 00 00 00 48 8b 7d 80 e8 fa cf ff ff 48 8b 8d 50 ff
Jun 8 15:37:15 atlantis kernel: [76793.638463] RIP [] balance_pgdat+0x2c6/0x6d0
Jun 8 15:37:15 atlantis kernel: [76793.638470] RSP
Jun 8 15:37:15 atlantis kernel: [76793.638474] CR2: 0000000000000000
Jun 8 15:37:15 atlantis kernel: [76793.648478] [drm] nouveau 0000:08:00.0: Setting dpms mode 0 on vga encoder (output 0)
Jun 8 15:37:15 atlantis kernel: [76793.653405] ---[ end trace 63ad9205b781d77b ]---
Jun 8 15:37:15 atlantis kernel: [76793.711542] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9201033690501981231
Jun 8 15:37:15 atlantis kernel: [76794.976611] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9218479331087539336
Jun 8 15:37:15 atlantis kernel: [76795.713710] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9168056624794443474
Jun 8 15:37:15 atlantis kernel: [76796.922008] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9127193401103567945
Jun 8 15:37:15 atlantis kernel: [76797.094878] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9124781767612430244
Jun 8 15:37:15 atlantis kernel: [76797.858908] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9045585419729552005
[>100 similar lines]

then there are messages that seem to relate to other processes that were dying...

Jun 8 15:42:29 atlantis kernel: [77066.595577] oom_kill_process: 27 callbacks suppressed
Jun 8 15:42:29 atlantis kernel: [77066.595773] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Jun 8 15:42:29 atlantis kernel: [77066.595972] java cpuset=/ mems_allowed=0-1
Jun 8 15:42:29 atlantis kernel: [77066.596593] Pid: 9041, comm: java Tainted: P D 2.6.38-8-server #42-Ubuntu
Jun 8 15:42:29 atlantis kernel: [77066.597478] Call Trace:
Jun 8 15:42:29 atlantis kernel: [77066.598412] [] ? cpuset_print_task_mems_allowed+0x9d/0xb0
Jun 8 15:42:29 atlantis kernel: [77066.600224] [] ? dump_header+0x91/0x1e0
Jun 8 15:42:29 atlantis kernel: [77066.601161] [] ? oom_kill_process+0x8d/0x190
Jun 8 15:42:29 atlantis kernel: [77066.602136] [] ? out_of_memory+0x102/0x240
Jun 8 15:42:29 atlantis kernel: [77066.603105] [] ? __alloc_pages_nodemask+0x7b9/0x830
Jun 8 15:42:29 atlantis kernel: [77066.604090] [] ? alloc_pages_current+0xa5/0x110
Jun 8 15:42:29 atlantis kernel: [77066.605068] [] ? __page_cache_alloc+0x8f/0xa0
Jun 8 15:42:29 atlantis kernel: [77066.606047] [] ? __do_page_cache_readahead+0xd5/0x220
Jun 8 15:42:29 atlantis kernel: [77066.607036] [] ? ra_submit+0x21/0x30
Jun 8 15:42:29 atlantis kernel: [77066.608027] [] ? filemap_fault+0x46e/0x490
Jun 8 15:42:29 atlantis kernel: [77066.609023] [] ? __do_fault+0x54/0x520
Jun 8 15:42:29 atlantis kernel: [77066.610018] [] ? handle_pte_fault+0xfa/0x210
Jun 8 15:42:29 atlantis kernel: [77066.610020] [] ? __free_pages+0x2d/0x40
Jun 8 15:42:29 atlantis kernel: [77066.610022] [] ? __pte_alloc+0xdf/0x100
Jun 8 15:42:29 atlantis kernel: [77066.610024] [] ? handle_mm_fault+0x16d/0x250
Jun 8 15:42:29 atlantis kernel: [77066.610029] [] ? do_page_fault+0x1a2/0x540
Jun 8 15:42:29 atlantis kernel: [77066.610034] [] ? __switch_to+0x157/0x2f0
Jun 8 15:42:29 atlantis kernel: [77066.610039] [] ? finish_task_switch+0x41/0xe0
Jun 8 15:42:29 atlantis kernel: [77066.610042] [] ? schedule+0x3ec/0x760
Jun 8 15:42:29 atlantis kernel: [77066.610044] [] ? page_fault+0x25/0x30
Jun 8 15:42:29 atlantis kernel: [77066.610046] Mem-Info:
Jun 8 15:42:29 atlantis kernel: [77066.610047] Node 0 DMA per-cpu:
Jun 8 15:42:29 atlantis kernel: [77066.610049] CPU 0: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610050] CPU 1: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610052] CPU 2: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610053] CPU 3: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610054] CPU 4: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610055] CPU 5: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610057] CPU 6: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610058] CPU 7: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610059] CPU 8: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610060] CPU 9: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610062] CPU 10: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610063] CPU 11: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610064] CPU 12: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610066] CPU 13: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.610067] CPU 14: hi: 0, btch: 1 usd: 0
[etc...]

Jun 8 15:42:29 atlantis kernel: [77066.610183] active_anon:3702813 inactive_anon:640 isolated_anon:0
Jun 8 15:42:29 atlantis kernel: [77066.610184] active_file:7 inactive_file:68 isolated_file:0
Jun 8 15:42:29 atlantis kernel: [77066.610185] unevictable:27 dirty:0 writeback:0 unstable:0
Jun 8 15:42:29 atlantis kernel: [77066.610185] free:18030 slab_reclaimable:3451 slab_unreclaimable:92128
Jun 8 15:42:29 atlantis kernel: [77066.610187] mapped:532 shmem:1005 pagetables:10899 bounce:0
Jun 8 15:42:29 atlantis kernel: [77066.610188] Node 0 DMA free:15840kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15664kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 8 15:42:29 atlantis kernel: [77066.610195] lowmem_reserve[]: 0 2991 12081 12081
Jun 8 15:42:29 atlantis kernel: [77066.610197] Node 0 DMA32 free:38828kB min:2460kB low:3072kB high:3688kB active_anon:2996280kB inactive_anon:0kB active_file:0kB inactive_file:16kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3063584kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:32kB slab_unreclaimable:320kB kernel_stack:216kB pagetables:6068kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:40 all_unreclaimable? no
Jun 8 15:42:29 atlantis kernel: [77066.610204] lowmem_reserve[]: 0 0 9090 9090
Jun 8 15:42:29 atlantis kernel: [77066.610206] Node 0 Normal free:7480kB min:7472kB low:9340kB high:11208kB active_anon:4263592kB inactive_anon:2096kB active_file:8kB inactive_file:128kB unevictable:56kB isolated(anon):0kB isolated(file):0kB present:9308160kB mlocked:0kB dirty:0kB writeback:0kB mapped:1972kB shmem:2792kB slab_reclaimable:7228kB slab_unreclaimable:143004kB kernel_stack:6608kB pagetables:16348kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:704 all_unreclaimable? no
Jun 8 15:42:29 atlantis kernel: [77066.610213] lowmem_reserve[]: 0 0 0 0
Jun 8 15:42:29 atlantis kernel: [77066.610215] Node 1 Normal free:9972kB min:9964kB low:12452kB high:14944kB active_anon:7551380kB inactive_anon:464kB active_file:20kB inactive_file:128kB unevictable:52kB isolated(anon):0kB isolated(file):0kB present:12410880kB mlocked:0kB dirty:0kB writeback:0kB mapped:156kB shmem:1228kB slab_reclaimable:6544kB slab_unreclaimable:225172kB kernel_stack:1528kB pagetables:21180kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:384 all_unreclaimable? no
Jun 8 15:42:29 atlantis kernel: [77066.610222] lowmem_reserve[]: 0 0 0 0
Jun 8 15:42:29 atlantis kernel: [77066.610224] Node 0 DMA: 0_4kB 0_8kB 0_16kB 1_32kB 1_64kB 1_128kB 1_256kB 0_512kB 1_1024kB 1_2048kB 3_4096kB = 15840kB
Jun 8 15:42:29 atlantis kernel: [77066.610229] Node 0 DMA32: 49_4kB 59_8kB 43_16kB 39_32kB 18_64kB 12_128kB 11_256kB 6_512kB 5_1024kB 5_2048kB 3_4096kB = 38828kB
Jun 8 15:42:29 atlantis kernel: [77066.610235] Node 0 Normal: 864_4kB 5_8kB 3_16kB 3_32kB 2_64kB 3_128kB 1_256kB 0_512kB 1_1024kB 1_2048kB 0_4096kB = 7480kB
Jun 8 15:42:29 atlantis kernel: [77066.610240] Node 1 Normal: 1615_4kB 9_8kB 3_16kB 2_32kB 2_64kB 3_128kB 3_256kB 2_512kB 1_1024kB 0_2048kB 0_4096kB = 9972kB
Jun 8 15:42:29 atlantis kernel: [77066.610245] 1151 total pagecache pages
Jun 8 15:42:29 atlantis kernel: [77066.610246] 0 pages in swap cache
Jun 8 15:42:29 atlantis kernel: [77066.610248] Swap cache stats: add 0, delete 0, find 0/0
Jun 8 15:42:29 atlantis kernel: [77066.610249] Free swap = 0kB
Jun 8 15:42:29 atlantis kernel: [77066.610249] Total swap = 0kB
Jun 8 15:42:29 atlantis kernel: [77066.683017] 6291440 pages RAM
Jun 8 15:42:29 atlantis kernel: [77066.683018] 108302 pages reserved
Jun 8 15:42:29 atlantis kernel: [77066.683019] 16638 pages shared
Jun 8 15:42:29 atlantis kernel: [77066.683020] 6091348 pages non-shared
Jun 8 15:42:29 atlantis kernel: [77066.683022] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
Jun 8 15:42:29 atlantis kernel: [77066.683044] [ 416] 0 416 4263 44 6 0 0 upstart-udev-br
Jun 8 15:42:29 atlantis kernel: [77066.683047] [ 446] 0 446 5347 147 1 -17 -1000 udevd
Jun 8 15:42:29 atlantis kernel: [77066.683050] [ 449] 0 449 22906 304 12 0 0 smbd
Jun 8 15:42:29 atlantis kernel: [77066.683053] [ 468] 101 468 46386 169 14 0 0 rsyslogd
Jun 8 15:42:29 atlantis kernel: [77066.683057] [ 803] 109 803 6003 124 8 0 0 dbus-daemon
Jun 8 15:42:29 atlantis kernel: [77066.683059] [ 846] 110 846 8063 110 20 0 0 avahi-daemon
Jun 8 15:42:29 atlantis kernel: [77066.683062] [ 863] 110 863 8002 64 0 0 0 avahi-daemon
Jun 8 15:42:29 atlantis kernel: [77066.683064] [ 872] 0 872 3751 53 19 0 0 upstart-socket-
Jun 8 15:42:29 atlantis kernel: [77066.683067] [ 962] 0 962 1549 28 2 0 0 getty
Jun 8 15:42:29 atlantis kernel: [77066.683069] [ 968] 0 968 1549 28 11 0 0 getty
Jun 8 15:42:29 atlantis kernel: [77066.683071] [ 975] 0 975 22906 292 20 0 0 smbd
Jun 8 15:42:29 atlantis kernel: [77066.683074] [ 977] 0 977 1549 27 7 0 0 getty
Jun 8 15:42:29 atlantis kernel: [77066.683077] [ 979] 0 979 1549 27 9 0 0 getty
Jun 8 15:42:29 atlantis kernel: [77066.683079] [ 985] 0 985 1549 28 8 0 0 getty
Jun 8 15:42:29 atlantis kernel: [77066.683083] [ 1008] 102 1008 42482 4342 6 0 0 mysqld
[etc...]
un 8 15:42:29 atlantis kernel: [77066.683361] [10937] 0 10937 4732 59 4 0 0 cron
Jun 8 15:42:29 atlantis kernel: [77066.683363] Out of memory: Kill process 9024 (java) score 335 or sacrifice child
Jun 8 15:42:29 atlantis kernel: [77066.683372] Killed process 9024 (java) total-vm:21908124kB, anon-rss:8304248kB, file-rss:0kB
Jun 8 15:42:29 atlantis kernel: [77066.731852] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Jun 8 15:42:29 atlantis kernel: [77066.731855] java cpuset=/ mems_allowed=0-1
Jun 8 15:42:29 atlantis kernel: [77066.731857] Pid: 2930, comm: java Tainted: P D 2.6.38-8-server #42-Ubuntu
Jun 8 15:42:29 atlantis kernel: [77066.731858] Call Trace:
Jun 8 15:42:29 atlantis kernel: [77066.731865] [] ? cpuset_print_task_mems_allowed+0x9d/0xb0
Jun 8 15:42:29 atlantis kernel: [77066.731868] [] ? dump_header+0x91/0x1e0
Jun 8 15:42:29 atlantis kernel: [77066.731872] [] ? security_real_capable_noaudit+0x3c/0x70
Jun 8 15:42:29 atlantis kernel: [77066.731875] [] ? oom_kill_process+0x8d/0x190
Jun 8 15:42:29 atlantis kernel: [77066.731876] [] ? out_of_memory+0x102/0x240
Jun 8 15:42:29 atlantis kernel: [77066.731879] [] ? __alloc_pages_nodemask+0x7b9/0x830
Jun 8 15:42:29 atlantis kernel: [77066.731884] [] ? __switch_to+0x157/0x2f0
Jun 8 15:42:29 atlantis kernel: [77066.731888] [] ? alloc_pages_current+0xa5/0x110
Jun 8 15:42:29 atlantis kernel: [77066.731891] [] ? __page_cache_alloc+0x8f/0xa0
Jun 8 15:42:29 atlantis kernel: [77066.731894] [] ? __do_page_cache_readahead+0xd5/0x220
Jun 8 15:42:29 atlantis kernel: [77066.731896] [] ? ra_submit+0x21/0x30
Jun 8 15:42:29 atlantis kernel: [77066.731899] [] ? filemap_fault+0x46e/0x490
Jun 8 15:42:29 atlantis kernel: [77066.731901] [] ? __do_fault+0x54/0x520
Jun 8 15:42:29 atlantis kernel: [77066.731903] [] ? handle_pte_fault+0xfa/0x210
Jun 8 15:42:29 atlantis kernel: [77066.731905] [] ? __free_pages+0x2d/0x40
Jun 8 15:42:29 atlantis kernel: [77066.731907] [] ? __pte_alloc+0xdf/0x100
Jun 8 15:42:29 atlantis kernel: [77066.731909] [] ? handle_mm_fault+0x16d/0x250
Jun 8 15:42:29 atlantis kernel: [77066.731913] [] ? do_page_fault+0x1a2/0x540
Jun 8 15:42:29 atlantis kernel: [77066.731915] [] ? __switch_to+0x157/0x2f0
Jun 8 15:42:29 atlantis kernel: [77066.731919] [] ? finish_task_switch+0x41/0xe0
Jun 8 15:42:29 atlantis kernel: [77066.731921] [] ? schedule+0x3ec/0x760
Jun 8 15:42:29 atlantis kernel: [77066.731923] [] ? page_fault+0x25/0x30
Jun 8 15:42:29 atlantis kernel: [77066.731924] Mem-Info:
Jun 8 15:42:29 atlantis kernel: [77066.731925] Node 0 DMA per-cpu:
Jun 8 15:42:29 atlantis kernel: [77066.731927] CPU 0: hi: 0, btch: 1 usd: 0
Jun 8 15:42:29 atlantis kernel: [77066.731929] CPU 1: hi: 0, btch: 1 usd: 0
[etc...]

then several more large block following the same pattern with other processes, then-

Jun 8 15:42:29 atlantis kernel: [77067.341948] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9162052222399601381
Jun 8 15:42:29 atlantis kernel: [77068.008235] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-7452290219319652215
Jun 8 15:42:29 atlantis kernel: [77068.591869] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9037515879784403491
Jun 8 15:42:29 atlantis kernel: [77069.070157] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9079891129197979364
Jun 8 15:42:29 atlantis kernel: [77069.651205] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9113202630456003311
Jun 8 15:42:29 atlantis kernel: [77070.348335] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-6290848699416128387
Jun 8 15:42:29 atlantis kernel: [77071.325323] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9093663112402461882
Jun 8 15:42:29 atlantis kernel: [77071.791675] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-1591817540979419011
Jun 8 15:42:29 atlantis kernel: [77072.462885] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-8993861594725407407
Jun 8 15:42:29 atlantis kernel: [77073.021273] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-8998373508622712020
Jun 8 15:42:29 atlantis kernel: [77073.530527] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9050062279031931269
Jun 8 15:42:29 atlantis kernel: [77074.139045] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9193301788445700848
Jun 8 15:42:29 atlantis kernel: [77074.634274] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9077472171237667047
Jun 8 15:42:29 atlantis kernel: [77075.252273] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9072678754481356111
Jun 8 15:42:29 atlantis kernel: [77075.821889] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9164829248933970879
Jun 8 15:42:29 atlantis kernel: [77076.326774] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9145887458222169952
Jun 8 15:42:29 atlantis kernel: [77076.866557] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9215976985087278887
Jun 8 15:42:29 atlantis kernel: [77077.350813] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9196285618058291576
Jun 8 15:42:29 atlantis kernel: [77077.895865] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9194038098953776933
Jun 8 15:42:29 atlantis kernel: [77078.418489] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9046795281052673497
Jun 8 15:42:29 atlantis kernel: [77079.009315] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9059008034126327778
Jun 8 15:42:29 atlantis kernel: [77079.529693] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9051327894404411558
Jun 8 15:42:29 atlantis kernel: [77080.127355] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9216434879940409432
Jun 8 15:42:29 atlantis kernel: [77080.717459] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9034018631730062956
Jun 8 15:42:29 atlantis kernel: [77081.337473] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-8772202894922101310
Jun 8 15:42:29 atlantis kernel: [77081.900259] shrink_slab: arc_shrinker_func+0x0/0xc0 [zfs] negative objects to delete nr=-9102259723032889908

that's all before I forced a reboot.

@behlendorf
Copy link
Contributor

The second issue here, the negative objects, is issue #218 which was fixed after 0.6.0-rc4. If you pull the latest master source you shouldn't be able recreate that issue. There was also a couple deadlocks fixed but none that looked like your first reported issue.

@gunnarbeutner
Copy link
Contributor

The problem here is that prune_icache() tries to evict/delete both the xattr directory inode as well as at least one xattr inode contained in that directory. Here's what happens:

  1. File is created.
  2. xattr is created for that file (behind the scenes a xattr directory and a file in that xattr directory are created)
  3. File is deleted.
  4. Both the xattr directory inode and at least one xattr inode from that directory are evicted by prune_icache(); prune_icache() acquires a lock on both inodes before it calls ->evict() on the inodes

When the xattr directory inode is evicted zfs_zinactive attempts to delete the xattr files contained in that directory. While enumerating these files zfs_zget() is called to obtain a reference to the xattr file znode - which tries to lock the xattr inode. However that very same xattr inode was already locked by prune_icache() further up the call stack, thus leading to a deadlock.

This can be reliably reproduced like this:

# touch test
# attr -s a -V b test
# rm test
# echo 3 > /proc/sys/vm/drop_caches

What this means is that we must not call zfs_zget() from zfs_zinactive(). One way of doing that would be to defer the zfs_purgedir() call until after the inodes are evicted.

Also, #273 is most likely duplicate of this bug.

behlendorf added a commit that referenced this issue Jun 15, 2011
Gunnar Beutner nicely summerized the root cause of this issue
in a comment for Github issue #266.

The problem here is that prune_icache() tries to evict/delete
both the xattr directory inode as well as at least one xattr
inode contained in that directory. Here's what happens:

1. File is created.
2. xattr is created for that file (behind the scenes a xattr
   directory and a file in that xattr directory are created)
3. File is deleted.
4. Both the xattr directory inode and at least one xattr
   inode from that directory are evicted by prune_icache();
   prune_icache() acquires a lock on both inodes before it
   calls ->evict() on the inodes

When the xattr directory inode is evicted zfs_zinactive attempts
to delete the xattr files contained in that directory. While
enumerating these files zfs_zget() is called to obtain a reference
to the xattr file znode - which tries to lock the xattr inode.
However that very same xattr inode was already locked by
prune_icache() further up the call stack, thus leading to a
deadlock.

This can be reliably reproduced like this:

---

This patch proposes to fix the issue by deferring the call to
zfs_rmnode() until after the inodes are evicted.  This is
accomplished by dispatching the zfs_rmnode() call to the existing
zfs_iput_taskq.  There is no requirement that it must complete
before zfs_zinactive() returns.  The inode is already unlinked
from the namespace an inactive so this should be safe.

Issue #266
@behlendorf
Copy link
Contributor

Nice job! I agree, that's exactly the problem. I was able to easily recreate the issue with your test case.

As for the fix we're definitely going to have to move the zfs_zget() call out of the zfs_inactive() call path to avoid the deadlock. However, it doesn't look like deferring zfs_purgedir() goes far enough, we might hit a similar deadlock farther down in zfs_rmnode() where zfs_zget() is called on the xattr_obj directory.

But from my reading of the code it should be safe to defer all of the zfs_rmnode() call. The only caller of zfs_rmnode() is zfs_zinactive() and it is only done in the unlink case. At this point everything is already unlinked from the namespace and the only remaining references on the inodes are held by ZFS. Additionally, all this code properly adds the unlinked objects to the unlinked set so they should be correctly handled in the case of a system crash.

Gunnar, what do you think about this proposed fix, 6e34ce7 . With this change in place I'm unable to recreate the issue nor have I seen any unexpected side effects.

@gunnarbeutner
Copy link
Contributor

After applying the patch I'm seeing the following crash:

[ 104.893196] SPLError: 5275:0:(zfs_znode.c:237:zfs_znode_dmu_fini()) ASSERTION(MUTEX_HELD(ZFS_OBJ_MUTEX(ZTOZSB(zp), zp->z_id)) || zp->z_unlinked || RW_WRITE_HELD(&ZTOZSB(zp)->z_teardown_inactive_lock)) failed
[ 104.896252] SPLError: 5275:0:(zfs_znode.c:237:zfs_znode_dmu_fini()) SPL PANIC
[ 104.896254] SPL: Showing stack for process 5275
[ 104.896256] Pid: 5275, comm: zfs_iput_taskq/ Tainted: P 2.6.38-8-server #42-Ubuntu
[ 104.896258] Call Trace:
[ 104.896267] [] ? spl_debug_dumpstack+0x27/0x40 [spl]
[ 104.896271] [] ? spl_debug_bug+0x81/0xe0 [spl]
[ 104.896298] [] ? zfs_znode_dmu_fini+0xfb/0x130 [zfs]
[ 104.896319] [] ? zfs_rmnode+0x238/0x450 [zfs]
[ 104.896329] [] ? default_spin_lock_flags+0x9/0x10
[ 104.896335] [] ? taskq_thread+0x27e/0x780 [spl]
[ 104.896339] [] ? finish_task_switch+0x41/0xe0
[ 104.896341] [] ? default_wake_function+0x0/0x20
[ 104.896346] [] ? taskq_thread+0x0/0x780 [spl]
[ 104.896349] [] ? kthread+0x96/0xa0
[ 104.896352] [] ? kernel_thread_helper+0x4/0x10
[ 104.896354] [] ? kthread+0x0/0xa0
[ 104.896356] [] ? kernel_thread_helper+0x0/0x10
[ 104.896589] SPL: Dumping log to /tmp/spl-log.1308169149.5275
[ 109.711448] hrtimer: interrupt took 31041314 ns

By the time zfs_rmnode() is actually called the znode_t* we pass to it might've actually been freed by the prune_icache() thread. Instead we would probably have to pass the zsb + obj_num (possibly along with the generation number just to be absolutely sure we're deleting the right inode, although this should be safe because no other process can obtain a reference to the deleted xattr dir) and call zfs_zget() to get a new reference to the xattr dir.

@behlendorf
Copy link
Contributor

It's never that easy is it. Deferring this safely looks like it's going to be problematic. What if we try a different approach.

The ilookup() in zfs_zget() above is really just blocked on the I_FREEING flag in the inode. If we could somehow safely detect this we could just skip this object id and avoid the deadlock entirely. The zfs_purgedir() code is already setup to leave the xattr dir in the unlink set if entries are skipped (because of lack of free space). We could leverage that existing machinery.

How exactly this could be done cleanly isn't obvious to me however. I'm open to all good ideas!

@behlendorf
Copy link
Contributor

There's one other invariant here we need to be careful off. The passed znode/inode must be free'd as part of evict(). This cannot be deferred because after return the znode/inode may be immediately reused. Other znode/inodes such as the xattr directory, or xattrs themselves, can be safely deferred at least until prune_icache() explicitly evicts them.

@gunnarbeutner
Copy link
Contributor

Here's my latest idea: 4322a5782e384d008e8c64bbbfb478c143d4c652

I've moved the zfs_purgedir() call to zfs_unlinked_drain(). Instead zfs_rmnode() now checks whether the xattr dir is empty and leaves the xattr dir in the unlinked set if it finds any xattrs.

There's at least one possible locking problem with the new code which I still need to think about: The ZFS superblock I pass to the taskq is not locked/ref-counted. If someone manages to unmount the filesystem before the taskq is scheduled zfs_unlinked_drain() might use a stale zsb struct - assuming it gets scheduled before the taskq is destroyed.

Also: it would probably be a good idea to have the zfs_unlinked_drain thread wait for a little while before actually removing znodes from the unlinked set - in case someone is simultaneously removing lots of files with xattrs. This might save us some unnecessary context switches.

@behlendorf
Copy link
Contributor

That seems reasonable and correct to me.

I took your patch one step farther and handled the super block reference problem by adding a call to taskq_wait() in zfsvfs_teardown(). This will be called as part of .put_super() which allows us to block until all of the work items we added to the taskq have been handled. Without this the feared reference problem was easy to hit as follows:

touch /tank/test

attr -s a -V b /tank/test

rm /tank/test

zfs umount tank

If this looks good to you I'll merge it in to master. See commit c3ff7ef8c0efd66be465ccdee6fa19ae215724d7

@behlendorf behlendorf reopened this Jun 20, 2011
@behlendorf
Copy link
Contributor

Reopening... for now. The commit comment closed it even though it wasn't committed to master. I'll need to keep that in mind in the future.

@behlendorf
Copy link
Contributor

Actually, see commit b00131d It's the same as the previous commit I just updated the commit message.

@behlendorf
Copy link
Contributor

Gunnar, I'd like to merge commit b00131d in to master. So far it appears to be working well. But if you have any remaining concerns please let us know. If not I'll try and merge it by the end of Tues PST.

@gunnarbeutner
Copy link
Contributor

Should be fine as far as I can tell. All the other possible locking issues I can think of are nicely taken care of by zfs_zget() returning ENOENT when another thread is already holding a reference to (partially) deleted inodes.

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 1, 2015
Because spl_slab_size() was always returning -ENOSPC for caches of
type KMC_OFFSLAB the cache could never be created.  Additionally
the slab size is rounded up to a page which is what kv_alloc()
expects.  The kv_alloc() code will minimally allocate a page,
in the KMC_OFFSLAB case this could be reduced.

The basic regression tests kmem:slab_small, kmem:slab_large,
and kmem:slab_align regression were updated to test KMC_OFFSLAB.

Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Ying Zhu <[email protected]>
Closes openzfs#266
fuhrmannb pushed a commit to fuhrmannb/cstor that referenced this issue Nov 3, 2020
openzfs#266)

Include component name and resource name in pool error messages
zpool logs with ecode, msg and reason

Signed-off-by: moteesh <[email protected]>
fuhrmannb pushed a commit to fuhrmannb/cstor that referenced this issue Nov 3, 2020
mmaybee pushed a commit to mmaybee/openzfs that referenced this issue Apr 6, 2022
LockSet can be simplified by using watch_once, and always dropping the
Sender (without .send()-ing).

Remove some commented-out code in watch_once.rs.
ixhamza added a commit to ixhamza/zfs that referenced this issue Jan 14, 2025
NAS-133441 / None / Sync truenas/zfs-2.3-release with upstream zfs-2.3-rc5 tag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants