zil_commit_waiter can stall forever #3

lundman · 2020-05-25T00:35:27Z

System information

Type	Version/Name
Distribution Name	macOS
Distribution Version
Linux Kernel
Architecture
ZFS Version
SPL Version

git HEAD

Describe the problem you're observing

a call to fsync can stall forever in zil_commit_waiter.

Describe how to reproduce the problem

Easily repeatable, first test encountered with zfs-tests.

Include any warning/errors/backtraces from the system logs

The stalled command:

 9:04AM sh               sh -x ./scripts/zfs-tests.sh -r tiny
 9:05AM zfs create testpool/testfs

The main stack:
Duration: 9.98s
Steps: 999 (10ms sampling interval)

zfs stack

  Thread 0xea80             DispatchQueue 140735589485544                       999 samples (1-999)       priority 29-31 (base 31)  cpu time 3.113s
  999  start + 1 (libdyld.dylib + 91093) [0x7fff587253d5]
    999  main + 760 (zfs_main.c:8549,9 in zfs + 10984) [0x103731ae8]
      999  zfs_do_create + 3209 (zfs_main.c:1135,8 in zfs + 18201) [0x103733719]
        999  zfs_mount_and_share + 251 (zfs_main.c:766,14 in zfs + 56107) [0x10373cb2b]
          999  zfs_mount + 141 (libzfs_mount.c:375,10 in libzfs.2.dylib + 116349) [0x1037bc67d]
            999  zfs_mount_at + 1391 (libzfs_mount.c:503,7 in libzfs.2.dylib + 117807) [0x1037bcc2f]
              999  mount + 10 (libsystem_kernel.dylib + 38470) [0x7fff58862646]
               *999  hndl_unix_scall64 + 22 (kernel.development + 180998) [0xffffff800022c306]
                 *999  unix_syscall64 + 757 (kernel.development + 7721589) [0xffffff800095d275]
                   *999  mount + 54 (kernel.development + 3440038) [0xffffff8000547da6]
                     *999  __mac_mount + 1004 (kernel.development + 3441052) [0xffffff800054819c]
                       *999  mount_common + 1303 (kernel.development + 3433207) [0xffffff80005462f7]
                         *999  prepare_coveredvp + 117 (kernel.development + 3438229) [0xffffff8000547695]
                           *999  zfs_vnop_fsync + 168 (zfs_vnops_osx.c:1643,8 in zfs + 1637736) [0xffffff7f82bb8d68]
                             *999  zfs_fsync + 380 (zfs_vnops.c:2485,3 in zfs + 1596204) [0xffffff7f82baeb2c]
                               *999  zil_commit + 156 (zil.c:2946,2 in zfs + 1738620) [0xffffff7f82bd177c]
                                 *999  zil_commit_impl + 101 (zil.c:2984,2 in zfs + 1738741) [0xffffff7f82bd17f5]
                                   *998  zil_commit_waiter + 246 (zil.c:2690,23 in zfs + 1739622) [0xffffff7f82bd1b66]
                                     *983  cv_timedwait_hires + 379 (spl-condvar.c:223,14 in zfs + 2467019) [0xffffff7f82c834cb]
                                       *982  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
                                         *980  _sleep + 234 (kernel.development + 6502554) [0xffffff800083389a]
                                           *910  lck_mtx_sleep_deadline + 155 (kernel.development + 666779) [0xffffff80002a2c9b]
                                             *842  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                                               *842  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                                                 *741  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135] (runnable)
                                                 *101  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]
                                             *68   thread_block_reason + 345 (kernel.development + 726889) [0xffffff80002b1769]
                                               *67   ml_set_interrupts_enabled + 48 (kernel.development + 2145584) [0xffffff800040bd30] (running)
                                               *1    return_to_iret + 276 (kernel.development + 179716) [0xffffff800022be04]
                                                 *1    ast_taken_kernel + 113 (kernel.development + 548657) [0xffffff8000285f31]
                                                   *1    thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                                                     *1    thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                                                       *1    machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135] (runnable)
                                           *69   lck_mtx_sleep_deadline + 105 (kernel.development + 666729) [0xffffff80002a2c69]
                                             *68   assert_wait_deadline + 152 (kernel.development + 725112) [0xffffff80002b1078]
                                               *61   ml_set_interrupts_enabled + 48 (kernel.development + 2145584) [0xffffff800040bd30] (running)
                                               *7    return_to_iret + 276 (kernel.development + 179716) [0xffffff800022be04]
                                                 *7    ast_taken_kernel + 113 (kernel.development + 548657) [0xffffff8000285f31]
                                                   *7    thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                                                     *7    thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                                                       *7    machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135] (runnable)
                                             *1    _global_eventq + 1 (kernel.development + 960657) [0xffffff80002ea891] (running)
                                           *1    lck_mtx_sleep_deadline + 176 (kernel.development + 666800) [0xffffff80002a2cb0]
                                             *1    lck_mtx_lock + 129 (kernel.development + 2141521) [0xffffff800040ad51] (running)
                                         *1    _sleep + 695 (kernel.development + 6503015) [0xffffff8000833a67] (running)
                                         *1    _sleep + 98 (kernel.development + 6502418) [0xffffff8000833812] (running)
                                       *1    msleep + 34 (kernel.development + 6504210) [0xffffff8000833f12] (running)
                                     *4    cv_timedwait_hires + 411 (spl-condvar.c:227,31 in zfs + 2467051) [0xffffff7f82c834eb]
                                       *4    gethrestime_sec + 17 (spl-time.c:108,5 in zfs + 2564369) [0xffffff7f82c9b111]
                                         *4    microtime + 25 (kernel.development + 6585785) [0xffffff8000847db9]
                                           *4    ml_set_interrupts_enabled + 48 (kernel.development + 2145584) [0xffffff800040bd30] (running)
                                     *3    spl_wdlist_settime + 1 (spl-mutex.c:73 in zfs + 2536849) [0xffffff7f82c94591] (running)
                                     *3    cv_timedwait_hires + 120 (spl-condvar.c:184,21 in zfs + 2466760) [0xffffff7f82c833c8]
                                       *1    _rtc_nanotime_read + 16 (kernel.development + 176208) [0xffffff800022b050] (running)
                                       *1    _rtc_nanotime_read + 12 (kernel.development + 176204) [0xffffff800022b04c] (running)
                                       *1    _rtc_nanotime_read + 7 (kernel.development + 176199) [0xffffff800022b047] (running)
                                     *2    current_thread + 1 (kernel.development + 2148737) [0xffffff800040c981] (running)
                                     *1    cv_timedwait_hires + 391 (spl-condvar.c:225,17 in zfs + 2467031) [0xffffff7f82c834d7] (running)
                                     *1    cv_timedwait_hires + 382 (spl-condvar.c:225,19 in zfs + 2467022) [0xffffff7f82c834ce] (running)
                                     *1    cv_timedwait_hires + 139 (spl-condvar.c:189,7 in zfs + 2466779) [0xffffff7f82c833db] (running)
                                   *1    zil_commit_waiter + 250 (zil.c:2694,17 in zfs + 1739626) [0xffffff7f82bd1b6a] (running)

There does not appear to be any other threads in play, generally they are all idle. For example;

other stacks

  Thread 0xea40             999 samples (1-999)       priority 81 (base 81)
 *999  call_continuation + 46 (kernel.development + 176334) [0xffffff800022b0ce]
   *999  taskq_thread + 711 (spl-taskq.c:1768,11 in zfs + 2561511) [0xffffff7f82c9a5e7]
     *999  taskq_thread_wait + 99 (spl-taskq.c:1677,3 in zfs + 2562403) [0xffffff7f82c9a963]
       *999  spl_cv_wait + 130 (spl-condvar.c:75,14 in zfs + 2466034) [0xffffff7f82c830f2]
         *999  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
           *999  _sleep + 474 (kernel.development + 6502794) [0xffffff800083398a]
             *999  lck_mtx_sleep + 136 (kernel.development + 666168) [0xffffff80002a2a38]
               *999  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                 *999  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                   *999  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]



  Thread 0xea41             999 samples (1-999)       priority 81 (base 81)     cpu time <0.001s
 *999  call_continuation + 46 (kernel.development + 176334) [0xffffff800022b0ce]
   *999  txg_quiesce_thread + 167 (txg.c:618,4 in zfs + 1030583) [0xffffff7f82b249b7]
     *999  txg_thread_wait + 132 (txg.c:254,3 in zfs + 1036404) [0xffffff7f82b26074]
       *999  spl_cv_wait + 130 (spl-condvar.c:75,14 in zfs + 2466034) [0xffffff7f82c830f2]
         *999  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
           *999  _sleep + 474 (kernel.development + 6502794) [0xffffff800083398a]
             *999  lck_mtx_sleep + 136 (kernel.development + 666168) [0xffffff80002a2a38]
               *999  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                 *999  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                   *999  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]

  Thread 0xea42             999 samples (1-999)       priority 81 (base 81)     cpu time <0.001s
 *999  call_continuation + 46 (kernel.development + 176334) [0xffffff800022b0ce]
   *998  txg_sync_thread + 373 (txg.c:542,4 in zfs + 1031157) [0xffffff7f82b24bf5]
     *998  txg_thread_wait + 92 (txg.c:251,10 in zfs + 1036364) [0xffffff7f82b2604c]
       *998  spl_cv_timedwait + 293 (spl-condvar.c:135,14 in zfs + 2466405) [0xffffff7f82c83265]
         *998  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
           *998  _sleep + 234 (kernel.development + 6502554) [0xffffff800083389a]
             *998  lck_mtx_sleep_deadline + 155 (kernel.development + 666779) [0xffffff80002a2c9b]
               *998  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                 *998  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                   *998  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]
   *1    txg_sync_thread + 760 (txg.c:579,3 in zfs + 1031544) [0xffffff7f82b24d78]
     *1    spa_sync + 1322 (spa.c:9023,2 in zfs + 918362) [0xffffff7f82b0935a]
       *1    spa_sync_iterate_to_convergence + 338 (spa.c:8826,3 in zfs + 920194) [0xffffff7f82b09a82]
         *1    ddt_sync + 211 (ddt.c:1154,9 in zfs + 207219) [0xffffff7f82a5b973]
           *1    zio_wait + 598 (zio.c:2203,11 in zfs + 1775062) [0xffffff7f82bda5d6]
             *1    spl_cv_timedwait + 293 (spl-condvar.c:135,14 in zfs + 2466405) [0xffffff7f82c83265]
               *1    msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
                 *1    _sleep + 234 (kernel.development + 6502554) [0xffffff800083389a]
                   *1    lck_mtx_sleep_deadline + 155 (kernel.development + 666779) [0xffffff80002a2c9b]
                     *1    thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                       *1    thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                         *1    machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]

  Thread 0xea43             999 samples (1-999)       priority 81 (base 81)     cpu time <0.001s
 *999  call_continuation + 46 (kernel.development + 176334) [0xffffff800022b0ce]
   *999  mmp_thread + 1904 (mmp.c:674,10 in zfs + 800992) [0xffffff7f82aec8e0]
     *999  cv_timedwait_hires + 379 (spl-condvar.c:223,14 in zfs + 2467019) [0xffffff7f82c834cb]
       *999  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
         *999  _sleep + 234 (kernel.development + 6502554) [0xffffff800083389a]
           *999  lck_mtx_sleep_deadline + 155 (kernel.development + 666779) [0xffffff80002a2c9b]
             *999  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
               *999  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                 *999  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]

  Thread 0xec17             67 samples (933-999)      priority 81 (base 81)     cpu time 0.003s
 *67  call_continuation + 46 (kernel.development + 176334) [0xffffff800022b0ce]
   *67  tx_flush_thread + 121 (apfs + 797086) [0xffffff7f818bb99e]
     *67  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
       *67  _sleep + 234 (kernel.development + 6502554) [0xffffff800083389a]
         *67  lck_mtx_sleep_deadline + 155 (kernel.development + 666779) [0xffffff80002a2c9b]
           *67  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
             *67  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
               *67  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]

The code that is stuck is here:

https://github.com/openzfsonosx/openzfs/blob/master/module/zfs/zil.c#L2690


		if (lwb != NULL && lwb->lwb_state == LWB_STATE_OPENED) {
			ASSERT3B(timedout, ==, B_FALSE);

			/*
			 * If the lwb hasn't been issued yet, then we
			 * need to wait with a timeout, in case this
			 * function needs to issue the lwb after the
			 * timeout is reached; responsibility (2) from
			 * the comment above this function.
===>		clock_t timeleft = cv_timedwait_hires(&zcw->zcw_cv,
			    &zcw->zcw_lock, wakeup, USEC2NSEC(1),
			    CALLOUT_FLAG_ABSOLUTE);

			if (timeleft >= 0 || zcw->zcw_done)
				continue;

			timedout = B_TRUE;
			zil_commit_waiter_timeout(zilog, zcw);

and it is spinning rather quickly, easily taking out a "core" on its own. But increasing the timeout does not fix the problem, just lowers the loadavg. It is simply not signalled, or if it is, it is missed.

The text was updated successfully, but these errors were encountered:

lundman · 2020-05-25T00:47:23Z

Adding some classy debugging printfs:

+       printf("%s: broadcast to %p\n", __func__, &zcw->zcw_cv);
        cv_broadcast(&zcw->zcw_cv);

+               printf("%s: broadcast2 to %p\n", __func__, &zcw->zcw_cv);
                cv_broadcast(&zcw->zcw_cv);


            clock_t timeleft = cv_timedwait_hires(&zcw->zcw_cv,
                            &zcw->zcw_lock, wakeup, USEC2NSEC(1),
                            CALLOUT_FLAG_ABSOLUTE);

+                       if (timeleft == -1) {
+                               printf("%s: %p spinning waiting for wakeup\n", __func__,
+                                       &zcw->zcw_cv);

We get the following output:

<zfs`zil_commit_waiter (zil.c:2699)> zil_commit_waiter: 0xffffff90ac9b3ee8 spinning waiting for wakeup
<zfs`zil_commit_waiter (zil.c:2699)> zil_commit_waiter: 0xffffff90ac9b3ee8 spinning waiting for wakeup

There is no signal to 0xffffff90ac9b3ee8

lundman · 2020-05-25T00:53:53Z

Removing some 'static' in front of those two functions, we can indeed confirm neither is called:

# dtrace -n 'zil_commit_waiter_skip:entry' -n 'zil_commit_waiter_skip:return { printf("%d", arg1);}' -n 'zil_lwb_flush_vdevs_done:entry' -n 'zil_lwb_flush_vdevs_done:return {printf("%d", arg1);}'

dtrace: description 'zil_commit_waiter_skip:entry' matched 1 probe
dtrace: description 'zil_commit_waiter_skip:return ' matched 1 probe
dtrace: description 'zil_lwb_flush_vdevs_done:entry' matched 1 probe
dtrace: description 'zil_lwb_flush_vdevs_done:return ' matched 1 probe

# nada

lundman · 2020-05-25T01:03:19Z

Removing all static and using:

# dtrace -n 'zil_*:entry' -n 'zil_*:return { printf("%d", arg1);}'

We end up with this log in its entirety, from start to hang:

dtrace zil*

CPU     ID                    FUNCTION:NAME
  0  79256                  zil_alloc:entry
  0  79257                 zil_alloc:return -478168599168
  0  79256                  zil_alloc:entry
  0  79257                 zil_alloc:return -478168600256
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169175152
  0  79243                  zil_sync:return -549527168128
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169175152
  0  79243                  zil_sync:return -549527168128
  1  79258                   zil_free:entry
  1  79259                  zil_free:return -549527162368
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return -549527168128
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 2
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 3
  0  79256                  zil_alloc:entry
  0  79257                 zil_alloc:return -478168601344
  1  79258                   zil_free:entry
  1  79259                  zil_free:return -549527162368
  1  79256                  zil_alloc:entry
  1  79257                 zil_alloc:return -478168602432
  1  79260                   zil_open:entry
  1  79261                  zil_open:return -478168602432
  1  79274                 zil_replay:entry
  1  79164                zil_destroy:entry
  1  79165               zil_destroy:return -478169176368
  1  79275                zil_replay:return -478169176368
  0  79256                  zil_alloc:entry
  0  79257                 zil_alloc:return -478168603520
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478163571440
  1  79243                  zil_sync:return 2
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478163571440
  1  79243                  zil_sync:return 2
  0  79258                   zil_free:entry
  0  79259                  zil_free:return -549527168128
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 2
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 3
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 4
  0  79256                  zil_alloc:entry
  0  79257                 zil_alloc:return -478168603520
  0  79258                   zil_free:entry
  0  79259                  zil_free:return -549527168128
  1  79256                  zil_alloc:entry
  1  79257                 zil_alloc:return -478168600256
  0  79258                   zil_free:entry
  0  79259                  zil_free:return -549527168128
  1  79276              zil_replaying:entry
  1  79277             zil_replaying:return 0
  1  79204             zil_itx_create:entry
  1  79205            zil_itx_create:return -478162252288
  1  79212             zil_itx_assign:entry
  1  79213            zil_itx_assign:return -549527162368
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169176368
  0  79243                  zil_sync:return -549527168128
  1  79238                 zil_commit:entry
  1  79240            zil_commit_impl:entry
  1  79214          zil_async_to_sync:entry
  1  79215         zil_async_to_sync:return 9
  1  79232    zil_alloc_commit_waiter:entry
  1  79233   zil_alloc_commit_waiter:return -478162256152
  1  79236      zil_commit_itx_assign:entry
  1  79204             zil_itx_create:entry
  1  79205            zil_itx_create:return -478162252800
  1  79212             zil_itx_assign:entry
  1  79213            zil_itx_assign:return -549527162368
  1  79237     zil_commit_itx_assign:return -549527162368
  1  79226          zil_commit_writer:entry
  1  79218        zil_get_commit_list:entry
  1  79219       zil_get_commit_list:return 9
  1  79220      zil_prune_commit_list:entry
  1  79221     zil_prune_commit_list:return -478162252224
  1  79224    zil_process_commit_list:entry
  1  79162                 zil_create:entry
  1  79130         zil_init_log_chain:entry
  1  79131        zil_init_log_chain:return -477873195880
  1  79152              zil_alloc_lwb:entry
  1  79244               zil_lwb_cons:entry
  1  79245              zil_lwb_cons:return 0
  1  79153             zil_alloc_lwb:return -478162260664
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169176368
  0  79243                  zil_sync:return -549527168128
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478169167856
  1  79243                  zil_sync:return -549527162368
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478169167856
  1  79243                  zil_sync:return 2
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478169167856
  1  79243                  zil_sync:return 3
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478169167856
  1  79243                  zil_sync:return 4
  0  79216                  zil_clean:entry
  0  79217                 zil_clean:return -478166638264
  1  79206             zil_itxg_clean:entry
  1  79207            zil_itxg_clean:return -549527162368
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169176368
  0  79243                  zil_sync:return -549527168128
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478169167856
  1  79243                  zil_sync:return -549527162368
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 2
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 3
  0  79216                  zil_clean:entry
  0  79217                 zil_clean:return -478166638264
  0  79206             zil_itxg_clean:entry
  0  79207            zil_itxg_clean:return -549527168128
  0  79163                zil_create:return -478162260664
  0  79186            zil_itx_destroy:entry
  0  79187           zil_itx_destroy:return -549527168128
  0  79202             zil_lwb_commit:entry
  0  79192         zil_lwb_write_open:entry
  0  79190 zil_lwb_set_zio_dependency:entry
  0  79191 zil_lwb_set_zio_dependency:return -478162260664
  0  79193        zil_lwb_write_open:return -549527168128
  0  79174 zil_commit_waiter_link_lwb:entry
  0  79175 zil_commit_waiter_link_lwb:return -549527168128
  0  79203            zil_lwb_commit:return -478162260664
  0  79225   zil_process_commit_list:return 0
  0  79227         zil_commit_writer:return -549527168128
  0  79230          zil_commit_waiter:entry
^C
(never returns)

lundman · 2020-05-25T01:13:19Z

The closest we come to cv_signal() appears to be entering zil_lwb_write_open() but we do not end up in zio_root( ...., zil_lwb_flush_vdevs_done, .... Because we don't issue it, or it isn't called?

lundman · 2020-05-25T01:15:58Z

or, the call for zil_process_commit_list() can also call zil_commit_waiter_skip() so which path are we expecting here?

lundman · 2020-05-25T01:27:53Z

They turn out to to skip due to:

zil_lwb_write_open(zilog_t *zilog, lwb_t *lwb)
{
 ....
    printf("%s: lwb_root_zio is %p\n", __func__, lwb->lwb_root_zio);
    if (lwb->lwb_root_zio == NULL) {

<zfs`zil_lwb_write_open (zil.c:1380)> zil_lwb_write_open: lwb_root_zio is 0x0

and

zil_process_commit_list(zilog_t *zilog)
{
....
    printf("%s: lwb is %p\n", __func__, lwb);
    if (lwb == NULL) {

<zfs`zil_process_commit_list (zil.c:2354)> zil_process_commit_list: lwb is 0xffffff90a608cd48

Might have to call upon @behlendorf to see if he has some insight.

lundman · 2020-05-25T03:30:05Z

Ok, it would seem related to:

            if (timeleft >= 0 || zcw->zcw_done)
                continue;

if replaced with:

            if (timeleft != -1 || zcw->zcw_done)
                continue;

Perhaps we have some signed vs unsigned issue going for it with clock_t.

Should I #ifdef APPLE around it, or would upstream be accepting of the alternate style?

lundman · 2020-05-25T06:28:25Z

adeb452

behlendorf · 2020-06-11T17:25:25Z

It looks like upstream may suffer from this same issue, openzfs/zfs#10440. It may be just a little harder to hit. Would you mind upstreaming your fix?

lundman · 2020-06-11T21:04:42Z

It's in the queue, but will move it forward and push out today

lundman · 2020-06-12T05:37:41Z

I don't see how it could affect Linux, but doesn't hurt to try

behlendorf · 2020-06-12T15:11:44Z

Me either. It's a signed type on Linux so it should be fine, but we'll need this change anyway for macos.

adamdmoss · 2020-06-15T11:04:12Z

Added a review comment, but in case it gets lost I have the following concern:

AFAICT, the magic -1 number only means 'timeout' in the user-space version of cv_timedwait_hires(), and not (at least for the freebsd spl) the kernel versions of cv_timedwait_hires().

So I'm a bit worried that this is incorrect on some (or all ;) ) platforms, and is equivalent to just removing the 'timeleft' comparison completely...

lundman · 2020-06-15T12:08:18Z

Likewise:

Since SPL is emulating the illumos API, we can see that cv_timedwait_hires() returns -1 for timeout:
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/condvar.c#L298

Linux and macOS both also do the same:
https://github.com/openzfs/zfs/blob/master/module/os/linux/spl/spl-condvar.c#L391
https://github.com/openzfsonosx/openzfs/blob/sorted2/module/os/macos/spl/spl-condvar.c#L243

Linux clock_t is signed, so they were fine. macOS clock_t is unsigned.
This test only cared that it had been signaled ( >= 0) or not. Which unfortunately, "-1" is also triggered(unsigned).
The only way out of this loop was to return -1.

I just negated the test, to check directly for -1 (or the unsigned value) The caller doesn't care about the "time left". This was to fix a hang in kernel code, since -1,0,timeleft all triggered ">=0" forever.

FreeBSD should probably correct their cv_timedwait_hires. Currently, you have no way to know if something was signaled (when returning 0 time left) and timeout expired, no signal. Over all, positive return value means "signal received" and negative means "no signal received".

mattmacy · 2020-06-15T19:04:21Z

Did this PR pass the FreeBSD bots? After rebasing I get tests hangs.

behlendorf · 2020-06-15T19:53:19Z

The CI did in fact catch this, but due to the other recent issues with FreeBSD head I didn't catch this at the time and it slipped in. Sorry!

adamdmoss · 2020-06-15T20:50:10Z

@lundman - thanks for the explanation, I think 'git grep' failed me when looking for what was the normal behavior.

Glad I sort of accidentally found a bug in the freebsd SPL though. 🤣

Mixing ZIL and normal allocations has several problems: 1. The ZIL allocations are allocated, written to disk, and then a few seconds later freed. This leaves behind holes (free segments) where the ZIL blocks used to be, which increases fragmentation, which negatively impacts performance. 2. When under moderate load, ZIL allocations are of 128KB. If the pool is fairly fragmented, there may not be many free chunks of that size. This causes ZFS to load more metaslabs to locate free segments of 128KB or more. The loading happens synchronously (from zil_commit()), and can take around a second even if the metaslab's spacemap is cached in the ARC. All concurrent synchronous operations on this filesystem must wait while the metaslab is loading. This can cause a significant performance impact. 3. If the pool is very fragmented, there may be zero free chunks of 128KB or more. In this case, the ZIL falls back to txg_wait_synced(), which has an enormous performance impact. These problems can be eliminated by using a dedicated log device ("slog"), even one with the same performance characteristics as the normal devices. This change sets aside one metaslab from each top-level vdev that is preferentially used for ZIL allocations (vdev_log_mg, spa_embedded_log_class). From an allocation perspective, this is similar to having a dedicated log device, and it eliminates the above-mentioned performance problems. Log (ZIL) blocks can be allocated from the following locations. Each one is tried in order until the allocation succeeds: 1. dedicated log vdevs, aka "slog" (spa_log_class) 2. embedded slog metaslabs (spa_embedded_log_class) 3. other metaslabs in normal vdevs (spa_normal_class) The space required for the embedded slog metaslabs is usually between 0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop" space that is not available for user data. On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity, and recordsize=8k, testing shows a ~50% performance increase on random 8k sync writes. On even more fragmented systems (which hit problem #3 above and call txg_wait_synced()), the performance improvement can be arbitrarily large (>100x). Reviewed-by: Serapheim Dimitropoulos <[email protected]> Reviewed-by: George Wilson <[email protected]> Reviewed-by: Don Brady <[email protected]> Reviewed-by: Mark Maybee <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #11389

lundman closed this as completed Jun 11, 2020

behlendorf mentioned this issue Jun 11, 2020

Process stuck at zfs_fsync openzfs/zfs#10440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zil_commit_waiter can stall forever #3

zil_commit_waiter can stall forever #3

lundman commented May 25, 2020 •

edited

Loading

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

behlendorf commented Jun 11, 2020

lundman commented Jun 11, 2020

lundman commented Jun 12, 2020

behlendorf commented Jun 12, 2020

adamdmoss commented Jun 15, 2020 •

edited

Loading

lundman commented Jun 15, 2020

mattmacy commented Jun 15, 2020

behlendorf commented Jun 15, 2020

adamdmoss commented Jun 15, 2020 •

edited

Loading

zil_commit_waiter can stall forever #3

zil_commit_waiter can stall forever #3

Comments

lundman commented May 25, 2020 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

lundman commented May 25, 2020

behlendorf commented Jun 11, 2020

lundman commented Jun 11, 2020

lundman commented Jun 12, 2020

behlendorf commented Jun 12, 2020

adamdmoss commented Jun 15, 2020 • edited Loading

lundman commented Jun 15, 2020

mattmacy commented Jun 15, 2020

behlendorf commented Jun 15, 2020

adamdmoss commented Jun 15, 2020 • edited Loading

lundman commented May 25, 2020 •

edited

Loading

adamdmoss commented Jun 15, 2020 •

edited

Loading

adamdmoss commented Jun 15, 2020 •

edited

Loading