Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dnode_hold_impl() soft lockup #8433

Merged
merged 1 commit into from
Feb 22, 2019
Merged

Conversation

lidongyang
Copy link
Contributor

Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.

Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.

Signed-off-by: Li Dongyang [email protected]
Closes #8426

Motivation and Context

Description

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf self-requested a review February 19, 2019 17:24
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice analysis and root cause. The proposed fix looks good, though it's not ideal that we may spin in the unlikely case where the dnode is highly contended. I think you should additionally an a call to cond_resched() in both the tryenter while loops to add an explicit place where the scheduler can be called.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Feb 19, 2019
@behlendorf behlendorf requested a review from ofaaland February 19, 2019 18:54
@lidongyang
Copy link
Contributor Author

@behlendorf agreed, shall we do the same to the while loop in dnode_free_interior_slots()?
JFYI this issue was discovered on a lustre MDT sever with zfs backend, here is the full trace:

[ 946.702862] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [mdt00_023:36936]
[ 946.703859] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [mdt00_021:32775]
[ 946.703901] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ptlrpc(OE) obdclass(OE) ko2iblnd(OE) lnet(OE) libcfs(OE) bonding rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx4_en(OE) mlx4_ib(OE) ib_core(OE) mlx4_core(OE) dm_mirror dm_region_hash dm_log zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) dm_round_robin sb_edac edac_core intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm ipmi_si irqbypass crc32_pclmul ghash_clmulni_intel ipmi_devintf iTCO_wdt iTCO_vendor_support nfsd aesni_intel ioatdma lrw dm_multipath gf128mul glue_helper sg hpwdt hpilo ablk_helper auth_rpcgss i2c_i801 lpc_ich pcspkr wmi cryptd nfs_acl acpi_power_meter
[ 946.703920] dm_mod dca shpchp ipmi_msghandler acpi_cpufreq lockd grace sunrpc knem(OE) ip_tables xfs libcrc32c mgag200 i2c_algo_bit sd_mod crc_t10dif drm_kms_helper crct10dif_generic syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx5_core(OE) mlxfw(OE) mlx_compat(OE) crct10dif_pclmul tg3 devlink crct10dif_common ptp crc32c_intel serio_raw i2c_core pps_core hpsa(OE) scsi_transport_sas
[ 946.703923] CPU: 0 PID: 32775 Comm: mdt00_021 Tainted: P OE ------------ 3.10.0-693.21.1.el7.x86_64 #1
[ 946.703924] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 04/25/2017
[ 946.703925] task: ffff885eef3c0fd0 ti: ffff885ec13a0000 task.ti: ffff885ec13a0000
[ 946.703983] RIP: 0010:[] [] zrl_is_locked+0x9/0x20 [zfs]
[ 946.703984] RSP: 0018:ffff885ec13a3818 EFLAGS: 00000297
[ 946.703985] RAX: 000000000000000a RBX: ffff885ec13a37a0 RCX: 0000000000000011
[ 946.703985] RDX: 0000000000000009 RSI: 0000000000000001 RDI: ffff88bc4fbd8160
[ 946.703986] RBP: ffff885ec13a3848 R08: ffff885ec13a0000 R09: ffff885eff417a90
[ 946.703987] R10: 7fffffffffffffff R11: 0000000000000000 R12: ffff885eabd69d40
[ 946.703987] R13: ffff88befcbff1c0 R14: ffff885eabd69d98 R15: ffff885eabd69de8
[ 946.703989] FS: 0000000000000000(0000) GS:ffff885eff400000(0000) knlGS:0000000000000000
[ 946.703989] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 946.703990] CR2: 0000562c1ac93098 CR3: 000000bec728c000 CR4: 00000000003607f0
[ 946.703991] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 946.703992] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 946.703992] Call Trace:
[ 946.704013] [] ? dnode_slots_rele+0x48/0x70 [zfs]
[ 946.704032] [] dnode_hold_impl+0x195/0xc30 [zfs]
[ 946.704046] [] ? dmu_buf_rele+0xe/0x10 [zfs]
[ 946.704071] [] ? zap_unlockdir+0x3f/0x60 [zfs]
[ 946.704097] [] ? zap_lookup_norm_by_dnode+0x97/0xc0 [zfs]
[ 946.704115] [] dnode_hold+0x1b/0x20 [zfs]
[ 946.704130] [] dmu_bonus_hold+0x32/0x1d0 [zfs]
[ 946.704140] [] __osd_obj2dnode+0x34/0xc0 [osd_zfs]
[ 946.704144] [] osd_object_init+0x11e/0x850 [osd_zfs]
[ 946.704187] [] ? lu_object_add+0x2d/0x40 [obdclass]
[ 946.704201] [] ? lod_object_init+0x20c/0x3c0 [lod]
[ 946.704218] [] lu_object_alloc+0xe5/0x320 [obdclass]
[ 946.704233] [] lu_object_find_at+0x180/0x2b0 [obdclass]
[ 946.704247] [] lu_object_find+0x16/0x20 [obdclass]
[ 946.704267] [] mdt_object_find+0x4b/0x170 [mdt]
[ 946.704274] [] mdt_getattr_name_lock+0x746/0x1910 [mdt]
[ 946.704281] [] mdt_intent_getattr+0x2b0/0x480 [mdt]
[ 946.704288] [] mdt_intent_policy+0x441/0xc70 [mdt]
[ 946.704341] [] ldlm_lock_enqueue+0x38a/0x980 [ptlrpc]
[ 946.704369] [] ldlm_handle_enqueue0+0x9d3/0x16a0 [ptlrpc]
[ 946.704398] [] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
[ 946.704437] [] tgt_enqueue+0x62/0x210 [ptlrpc]
[ 946.704468] [] tgt_request_handle+0x92a/0x1370 [ptlrpc]
[ 946.704494] [] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
[ 946.704520] [] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[ 946.704525] [] ? wake_up_state+0x20/0x20
[ 946.704551] [] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[ 946.704578] [] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
[ 946.704581] [] kthread+0xd1/0xe0
[ 946.704583] [] ? insert_kthread_work+0x40/0x40
[ 946.704586] [] ret_from_fork+0x77/0xb0
[ 946.704587] [] ? insert_kthread_work+0x40/0x40

@behlendorf
Copy link
Contributor

shall we do the same to the while loop in dnode_free_interior_slots()?

Yes, good idea. Please do so.

@lidongyang
Copy link
Contributor Author

Looks like dnode.c is used to build libzpool.so, and unlike zfs_znode.c, we don't have #ifdef _KERNEL to distinguish the functions needed by userland and kernel. That makes it hard to add cond_resched() and include linux/sched.h
Maybe that's why we don't have it in dnode_free_interior_slots()?

@PaulZ-98
Copy link
Contributor

I think it is ok to selectively #ifdef _KERNEL on selected code in dnode.c.

@behlendorf
Copy link
Contributor

We're going to want to add a user space wrapper for this which is a no-op to fix the build. I'd suggest adding the following to include/sys/zfs_context.h which provides similar wrapped for user space already.

diff --git a/include/sys/zfs_context.h b/include/sys/zfs_context.h
index d186c40..52b8f5b 100644
--- a/include/sys/zfs_context.h
+++ b/include/sys/zfs_context.h
@@ -236,6 +236,7 @@ extern kthread_t *zk_thread_create(void (*func)(void *), void *arg,
 
 #define        kpreempt_disable()      ((void)0)
 #define        kpreempt_enable()       ((void)0)
+#define        cond_resched()          ((void)0)
 
 /*
  * Mutexes

@lidongyang
Copy link
Contributor Author

cool, thanks for pointing out zfs_context.h, we don't need open coding like #ifdef _KERNEL in dnode.c
before I update the commit, shall we use sched_yield() here for userland?

@behlendorf
Copy link
Contributor

That makes sense, yes let's define cond_resched as sched_yield in user space.

Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.

Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.

Signed-off-by: Li Dongyang <[email protected]>
Closes openzfs#8426
@codecov
Copy link

codecov bot commented Feb 21, 2019

Codecov Report

Merging #8433 into master will decrease coverage by 0.05%.
The diff coverage is 80.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8433      +/-   ##
==========================================
- Coverage   78.55%   78.49%   -0.06%     
==========================================
  Files         380      380              
  Lines      116004   115900     -104     
==========================================
- Hits        91125    90977     -148     
- Misses      24879    24923      +44
Flag Coverage Δ
#kernel 79.02% <83.87%> (+0.06%) ⬆️
#user 67.03% <77.77%> (-0.3%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9abbee4...02a0da1. Read the comment docs.

Copy link
Contributor

@tcaputi tcaputi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I never really liked the way we looped there anyway so this is a double win for me.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Feb 21, 2019
@behlendorf behlendorf merged commit 8d9e51c into openzfs:master Feb 22, 2019
@jasimmons1973
Copy link

We have deployed ZFS 0.7.13 with this patch applied and it reduced the number of times this bug happens but we did hit it again with our production system. Should this ticket be reopened ? Similar back trace:

[1465132.552779] [] ? dnode_slots_rele+0x57/0x70 [zfs]
[1465132.552779] Call Trace:
[1465132.552781] [] ? __schedule+0x3ff/0x890
[1465132.552800] [] ? dnode_hold_impl+0x185/0xc40 [zfs]
[1465132.552802] [] ? __mutex_lock_slowpath+0x1b8/0x1d0
[1465132.552820] [] ? dnode_hold+0x1b/0x20 [zfs]
[1465132.552822] [] ? mutex_lock+0x1f/0x2f
[1465132.552840] [] ? dmu_bonus_hold+0x32/0x1d0 [zfs]
[1465132.552844] [] ? cv_wait_common+0xf2/0x150 [spl]
[1465132.552850] [] ? __osd_obj2dnode+0x34/0xc0 [osd_zfs]
[1465132.552852] [] ? wake_up_atomic_t+0x30/0x30
[1465132.552859] [] ? osd_object_init+0x1de/0xc90 [osd_zfs]
[1465132.552882] [] ? zrl_tryenter+0x9/0x30 [zfs]
[1465132.552900] [] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[1465132.552920] [] ? dnode_slots_tryenter+0x48/0xb0 [zfs]
[1465132.552938] [] ? lu_object_add+0x2d/0x40 [obdclass]
[1465132.552957] [] ? dnode_hold_impl+0x195/0xc40 [zfs]
[1465132.552975] [] ? lu_object_alloc+0xe5/0x320 [obdclass]
[1465132.552999] [] ? zap_unlockdir+0x3f/0x60 [zfs]
[1465132.553016] [] ? lu_object_find_at+0x180/0x2b0 [obdclass]
[1465132.553039] [] ? zap_lookup_norm+0xa0/0xd0 [zfs]
[1465132.553057] [] ? lu_object_find+0x16/0x20 [obdclass]
[1465132.553076] [] ? dnode_hold+0x1b/0x20 [zfs]
[1465132.553082] [] ? ofd_object_find+0x35/0x100 [ofd]

@lidongyang
Copy link
Contributor Author

@jasimmons1973 can we have a complete trace? is it still soft lockup?

@jasimmons1973
Copy link

Do you need just the complete dmesg output or do you need kdump info as well?

@jasimmons1973
Copy link

Here is the crash dump bt

#25 [ffff8dcd0805b6f8] apic_timer_interrupt at ffffffffa1775df2
[exception RIP: zrl_add_impl+82]
RIP: ffffffffc0a7dd82 RSP: ffff8dcd0805b7a0 RFLAGS: 00000202
RAX: 000000000000000b RBX: 00000000000003dd RCX: 0000000000000010
RDX: 000000000000000f RSI: ffffffffc0aee3f0 RDI: ffff8dcbc311edd8
RBP: ffff8dcd0805b7c8 R8: 0000000000000101 R9: 0000000000000001
R10: 00000000000003dd R11: 0000000000000000 R12: ffff8dcd0805b720
R13: ffff8dcd0805b7c8 R14: ffff8dc373925d98 R15: 0000000000000000
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#26 [ffff8dcd0805b7d0] dnode_hold_impl at ffffffffc09d2124 [zfs]
#27 [ffff8dcd0805b878] dnode_hold at ffffffffc09d2c2b [zfs]
#28 [ffff8dcd0805b888] dmu_bonus_hold at ffffffffc09bb682 [zfs]
#29 [ffff8dcd0805b8b8] __osd_obj2dnode at ffffffffc10ec874 [osd_zfs]
#30 [ffff8dcd0805b8f0] osd_object_init at ffffffffc10ecade [osd_zfs]
#31 [ffff8dcd0805b990] lu_object_alloc at ffffffffc11c6dd5 [obdclass]
#32 [ffff8dcd0805b9e8] lu_object_find_at at ffffffffc11c71f0 [obdclass]
#33 [ffff8dcd0805ba50] lu_object_find at ffffffffc11c7336 [obdclass]
#34 [ffff8dcd0805ba60] ofd_object_find at ffffffffc169c445 [ofd]
#35 [ffff8dcd0805ba88] ofd_preprw at ffffffffc16a5ea0 [ofd]
#36 [ffff8dcd0805bb28] tgt_brw_read at ffffffffc1417799 [ptlrpc]
#37 [ffff8dcd0805bd00] tgt_request_handle at ffffffffc141605a [ptlrpc]
#38 [ffff8dcd0805bd48] ptlrpc_server_handle_request at ffffffffc13bb843 [ptlrpc]
#39 [ffff8dcd0805bde8] ptlrpc_main at ffffffffc13beff2 [ptlrpc]
#40 [ffff8dcd0805bec8] kthread at ffffffffa10c1c71
#41 [ffff8dcd0805bf50] ret_from_fork_nospec_begin at ffffffffa1774c1d

@lidongyang
Copy link
Contributor Author

complete dmesg would also help.
can you have a look at dnode_hold_alloc_lock_retry and zrl->zr_refcount from the kdump?

if you can get the kdump uploaded somewhere it would be even better.
Since the patch was cherry picked on top of 0.7.13, we need your packages(including debuginfo packages) as well.

@jasimmons1973
Copy link

jasimmons1973 commented Apr 24, 2019

Sadly we are not allowed to release vmcores. You can get the dmesg from:

http://www.infradead.org/~jsimmons/vmcore-dmesg.txt

Its too big to post here.

@jasimmons1973
Copy link

Sorry but how does one examine data like zrl->zr_refcount with kdump. I'm attempting to use bt -ff but I'm not see a clear way to find the zr_refcount value.

@lidongyang
Copy link
Contributor Author

James,
for dnode_hold_alloc_lock_retry you can get it in the crash by looking at dnode_stats.

for zrl->zr_refcount, you need to figure out the address of db from the stack, which is a pointer to dmu_buf_impl_t, then look at db->db_user it's an pointer to dnc->dnc_dbu. so you can locate dnc, which is struct dnode_children, then you can look at dnc->dnc_children which is dynamically allocated array of dnode_handle_t. then for each dnode_hanlde_t you can see the dnh_zr_lock which is struct zrlock_t, then finally the zr_refount.

I'll see if I can come up a step to step guide without the dump, it could be difficult.
Dongyang

@bzzz77
Copy link
Contributor

bzzz77 commented Jul 9, 2019

we've got another report ( https://jira.whamcloud.com/browse/EDU-119 ) which seem to be related at least. and frankly I don't quite understand how the original patch is supposed to work given multiple threads might be spinning trying to grab the slots starting from different idx.

@lidongyang
Copy link
Contributor Author

Alex, if the threads cant grab all the slots down the way they will release the zrl they have and starts from beginning, so the one with the largest idx will win.
however whether we should use zrl and spin on write is another question...

@bzzz77
Copy link
Contributor

bzzz77 commented Jul 10, 2019

let's consider a case:
thread1 wants to grab 4-slots starting from idx=0
thread2 wants to grab 2-slots starting from idx=2
T1 got slot0 and slot1, going to take slot2, but then T2 got slot2 -- both restarts, again and again?

@lidongyang
Copy link
Contributor Author

correct me if I'm wrong but I think after T2 got slot2, T1 will release slot0 and slot1 then restart, T2 will continue with slot3, and after it's done with the locks it will release slot2 and 3,
then after that T1 will be able to continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

livelock in dnode_hold_impl()
6 participants