Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zil_commit_waiter can stall forever #3

Closed
lundman opened this issue May 25, 2020 · 17 comments
Closed

zil_commit_waiter can stall forever #3

lundman opened this issue May 25, 2020 · 17 comments

Comments

@lundman
Copy link
Contributor

lundman commented May 25, 2020

System information

Type Version/Name
Distribution Name macOS
Distribution Version
Linux Kernel
Architecture
ZFS Version
SPL Version

git HEAD

Describe the problem you're observing

a call to fsync can stall forever in zil_commit_waiter.

Describe how to reproduce the problem

Easily repeatable, first test encountered with zfs-tests.

Include any warning/errors/backtraces from the system logs

The stalled command:

 9:04AM sh               sh -x ./scripts/zfs-tests.sh -r tiny
 9:05AM zfs create testpool/testfs

The main stack:
Duration: 9.98s
Steps: 999 (10ms sampling interval)

zfs stack

  Thread 0xea80             DispatchQueue 140735589485544                       999 samples (1-999)       priority 29-31 (base 31)  cpu time 3.113s
  999  start + 1 (libdyld.dylib + 91093) [0x7fff587253d5]
    999  main + 760 (zfs_main.c:8549,9 in zfs + 10984) [0x103731ae8]
      999  zfs_do_create + 3209 (zfs_main.c:1135,8 in zfs + 18201) [0x103733719]
        999  zfs_mount_and_share + 251 (zfs_main.c:766,14 in zfs + 56107) [0x10373cb2b]
          999  zfs_mount + 141 (libzfs_mount.c:375,10 in libzfs.2.dylib + 116349) [0x1037bc67d]
            999  zfs_mount_at + 1391 (libzfs_mount.c:503,7 in libzfs.2.dylib + 117807) [0x1037bcc2f]
              999  mount + 10 (libsystem_kernel.dylib + 38470) [0x7fff58862646]
               *999  hndl_unix_scall64 + 22 (kernel.development + 180998) [0xffffff800022c306]
                 *999  unix_syscall64 + 757 (kernel.development + 7721589) [0xffffff800095d275]
                   *999  mount + 54 (kernel.development + 3440038) [0xffffff8000547da6]
                     *999  __mac_mount + 1004 (kernel.development + 3441052) [0xffffff800054819c]
                       *999  mount_common + 1303 (kernel.development + 3433207) [0xffffff80005462f7]
                         *999  prepare_coveredvp + 117 (kernel.development + 3438229) [0xffffff8000547695]
                           *999  zfs_vnop_fsync + 168 (zfs_vnops_osx.c:1643,8 in zfs + 1637736) [0xffffff7f82bb8d68]
                             *999  zfs_fsync + 380 (zfs_vnops.c:2485,3 in zfs + 1596204) [0xffffff7f82baeb2c]
                               *999  zil_commit + 156 (zil.c:2946,2 in zfs + 1738620) [0xffffff7f82bd177c]
                                 *999  zil_commit_impl + 101 (zil.c:2984,2 in zfs + 1738741) [0xffffff7f82bd17f5]
                                   *998  zil_commit_waiter + 246 (zil.c:2690,23 in zfs + 1739622) [0xffffff7f82bd1b66]
                                     *983  cv_timedwait_hires + 379 (spl-condvar.c:223,14 in zfs + 2467019) [0xffffff7f82c834cb]
                                       *982  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
                                         *980  _sleep + 234 (kernel.development + 6502554) [0xffffff800083389a]
                                           *910  lck_mtx_sleep_deadline + 155 (kernel.development + 666779) [0xffffff80002a2c9b]
                                             *842  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                                               *842  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                                                 *741  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135] (runnable)
                                                 *101  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]
                                             *68   thread_block_reason + 345 (kernel.development + 726889) [0xffffff80002b1769]
                                               *67   ml_set_interrupts_enabled + 48 (kernel.development + 2145584) [0xffffff800040bd30] (running)
                                               *1    return_to_iret + 276 (kernel.development + 179716) [0xffffff800022be04]
                                                 *1    ast_taken_kernel + 113 (kernel.development + 548657) [0xffffff8000285f31]
                                                   *1    thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                                                     *1    thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                                                       *1    machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135] (runnable)
                                           *69   lck_mtx_sleep_deadline + 105 (kernel.development + 666729) [0xffffff80002a2c69]
                                             *68   assert_wait_deadline + 152 (kernel.development + 725112) [0xffffff80002b1078]
                                               *61   ml_set_interrupts_enabled + 48 (kernel.development + 2145584) [0xffffff800040bd30] (running)
                                               *7    return_to_iret + 276 (kernel.development + 179716) [0xffffff800022be04]
                                                 *7    ast_taken_kernel + 113 (kernel.development + 548657) [0xffffff8000285f31]
                                                   *7    thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                                                     *7    thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                                                       *7    machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135] (runnable)
                                             *1    _global_eventq + 1 (kernel.development + 960657) [0xffffff80002ea891] (running)
                                           *1    lck_mtx_sleep_deadline + 176 (kernel.development + 666800) [0xffffff80002a2cb0]
                                             *1    lck_mtx_lock + 129 (kernel.development + 2141521) [0xffffff800040ad51] (running)
                                         *1    _sleep + 695 (kernel.development + 6503015) [0xffffff8000833a67] (running)
                                         *1    _sleep + 98 (kernel.development + 6502418) [0xffffff8000833812] (running)
                                       *1    msleep + 34 (kernel.development + 6504210) [0xffffff8000833f12] (running)
                                     *4    cv_timedwait_hires + 411 (spl-condvar.c:227,31 in zfs + 2467051) [0xffffff7f82c834eb]
                                       *4    gethrestime_sec + 17 (spl-time.c:108,5 in zfs + 2564369) [0xffffff7f82c9b111]
                                         *4    microtime + 25 (kernel.development + 6585785) [0xffffff8000847db9]
                                           *4    ml_set_interrupts_enabled + 48 (kernel.development + 2145584) [0xffffff800040bd30] (running)
                                     *3    spl_wdlist_settime + 1 (spl-mutex.c:73 in zfs + 2536849) [0xffffff7f82c94591] (running)
                                     *3    cv_timedwait_hires + 120 (spl-condvar.c:184,21 in zfs + 2466760) [0xffffff7f82c833c8]
                                       *1    _rtc_nanotime_read + 16 (kernel.development + 176208) [0xffffff800022b050] (running)
                                       *1    _rtc_nanotime_read + 12 (kernel.development + 176204) [0xffffff800022b04c] (running)
                                       *1    _rtc_nanotime_read + 7 (kernel.development + 176199) [0xffffff800022b047] (running)
                                     *2    current_thread + 1 (kernel.development + 2148737) [0xffffff800040c981] (running)
                                     *1    cv_timedwait_hires + 391 (spl-condvar.c:225,17 in zfs + 2467031) [0xffffff7f82c834d7] (running)
                                     *1    cv_timedwait_hires + 382 (spl-condvar.c:225,19 in zfs + 2467022) [0xffffff7f82c834ce] (running)
                                     *1    cv_timedwait_hires + 139 (spl-condvar.c:189,7 in zfs + 2466779) [0xffffff7f82c833db] (running)
                                   *1    zil_commit_waiter + 250 (zil.c:2694,17 in zfs + 1739626) [0xffffff7f82bd1b6a] (running)

There does not appear to be any other threads in play, generally they are all idle. For example;

other stacks

  Thread 0xea40             999 samples (1-999)       priority 81 (base 81)
 *999  call_continuation + 46 (kernel.development + 176334) [0xffffff800022b0ce]
   *999  taskq_thread + 711 (spl-taskq.c:1768,11 in zfs + 2561511) [0xffffff7f82c9a5e7]
     *999  taskq_thread_wait + 99 (spl-taskq.c:1677,3 in zfs + 2562403) [0xffffff7f82c9a963]
       *999  spl_cv_wait + 130 (spl-condvar.c:75,14 in zfs + 2466034) [0xffffff7f82c830f2]
         *999  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
           *999  _sleep + 474 (kernel.development + 6502794) [0xffffff800083398a]
             *999  lck_mtx_sleep + 136 (kernel.development + 666168) [0xffffff80002a2a38]
               *999  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                 *999  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                   *999  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]



  Thread 0xea41             999 samples (1-999)       priority 81 (base 81)     cpu time <0.001s
 *999  call_continuation + 46 (kernel.development + 176334) [0xffffff800022b0ce]
   *999  txg_quiesce_thread + 167 (txg.c:618,4 in zfs + 1030583) [0xffffff7f82b249b7]
     *999  txg_thread_wait + 132 (txg.c:254,3 in zfs + 1036404) [0xffffff7f82b26074]
       *999  spl_cv_wait + 130 (spl-condvar.c:75,14 in zfs + 2466034) [0xffffff7f82c830f2]
         *999  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
           *999  _sleep + 474 (kernel.development + 6502794) [0xffffff800083398a]
             *999  lck_mtx_sleep + 136 (kernel.development + 666168) [0xffffff80002a2a38]
               *999  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                 *999  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                   *999  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]

  Thread 0xea42             999 samples (1-999)       priority 81 (base 81)     cpu time <0.001s
 *999  call_continuation + 46 (kernel.development + 176334) [0xffffff800022b0ce]
   *998  txg_sync_thread + 373 (txg.c:542,4 in zfs + 1031157) [0xffffff7f82b24bf5]
     *998  txg_thread_wait + 92 (txg.c:251,10 in zfs + 1036364) [0xffffff7f82b2604c]
       *998  spl_cv_timedwait + 293 (spl-condvar.c:135,14 in zfs + 2466405) [0xffffff7f82c83265]
         *998  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
           *998  _sleep + 234 (kernel.development + 6502554) [0xffffff800083389a]
             *998  lck_mtx_sleep_deadline + 155 (kernel.development + 666779) [0xffffff80002a2c9b]
               *998  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                 *998  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                   *998  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]
   *1    txg_sync_thread + 760 (txg.c:579,3 in zfs + 1031544) [0xffffff7f82b24d78]
     *1    spa_sync + 1322 (spa.c:9023,2 in zfs + 918362) [0xffffff7f82b0935a]
       *1    spa_sync_iterate_to_convergence + 338 (spa.c:8826,3 in zfs + 920194) [0xffffff7f82b09a82]
         *1    ddt_sync + 211 (ddt.c:1154,9 in zfs + 207219) [0xffffff7f82a5b973]
           *1    zio_wait + 598 (zio.c:2203,11 in zfs + 1775062) [0xffffff7f82bda5d6]
             *1    spl_cv_timedwait + 293 (spl-condvar.c:135,14 in zfs + 2466405) [0xffffff7f82c83265]
               *1    msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
                 *1    _sleep + 234 (kernel.development + 6502554) [0xffffff800083389a]
                   *1    lck_mtx_sleep_deadline + 155 (kernel.development + 666779) [0xffffff80002a2c9b]
                     *1    thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
                       *1    thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                         *1    machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]

  Thread 0xea43             999 samples (1-999)       priority 81 (base 81)     cpu time <0.001s
 *999  call_continuation + 46 (kernel.development + 176334) [0xffffff800022b0ce]
   *999  mmp_thread + 1904 (mmp.c:674,10 in zfs + 800992) [0xffffff7f82aec8e0]
     *999  cv_timedwait_hires + 379 (spl-condvar.c:223,14 in zfs + 2467019) [0xffffff7f82c834cb]
       *999  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
         *999  _sleep + 234 (kernel.development + 6502554) [0xffffff800083389a]
           *999  lck_mtx_sleep_deadline + 155 (kernel.development + 666779) [0xffffff80002a2c9b]
             *999  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
               *999  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
                 *999  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]

  Thread 0xec17             67 samples (933-999)      priority 81 (base 81)     cpu time 0.003s
 *67  call_continuation + 46 (kernel.development + 176334) [0xffffff800022b0ce]
   *67  tx_flush_thread + 121 (apfs + 797086) [0xffffff7f818bb99e]
     *67  msleep + 98 (kernel.development + 6504274) [0xffffff8000833f52]
       *67  _sleep + 234 (kernel.development + 6502554) [0xffffff800083389a]
         *67  lck_mtx_sleep_deadline + 155 (kernel.development + 666779) [0xffffff80002a2c9b]
           *67  thread_block_reason + 332 (kernel.development + 726876) [0xffffff80002b175c]
             *67  thread_invoke + 1668 (kernel.development + 734164) [0xffffff80002b33d4]
               *67  machine_switch_context + 245 (kernel.development + 2154805) [0xffffff800040e135]

The code that is stuck is here:

https://github.com/openzfsonosx/openzfs/blob/master/module/zfs/zil.c#L2690


		if (lwb != NULL && lwb->lwb_state == LWB_STATE_OPENED) {
			ASSERT3B(timedout, ==, B_FALSE);

			/*
			 * If the lwb hasn't been issued yet, then we
			 * need to wait with a timeout, in case this
			 * function needs to issue the lwb after the
			 * timeout is reached; responsibility (2) from
			 * the comment above this function.
===>		clock_t timeleft = cv_timedwait_hires(&zcw->zcw_cv,
			    &zcw->zcw_lock, wakeup, USEC2NSEC(1),
			    CALLOUT_FLAG_ABSOLUTE);

			if (timeleft >= 0 || zcw->zcw_done)
				continue;

			timedout = B_TRUE;
			zil_commit_waiter_timeout(zilog, zcw);

and it is spinning rather quickly, easily taking out a "core" on its own. But increasing the timeout does not fix the problem, just lowers the loadavg. It is simply not signalled, or if it is, it is missed.

@lundman
Copy link
Contributor Author

lundman commented May 25, 2020

Adding some classy debugging printfs:

+       printf("%s: broadcast to %p\n", __func__, &zcw->zcw_cv);
        cv_broadcast(&zcw->zcw_cv);

+               printf("%s: broadcast2 to %p\n", __func__, &zcw->zcw_cv);
                cv_broadcast(&zcw->zcw_cv);


            clock_t timeleft = cv_timedwait_hires(&zcw->zcw_cv,
                            &zcw->zcw_lock, wakeup, USEC2NSEC(1),
                            CALLOUT_FLAG_ABSOLUTE);

+                       if (timeleft == -1) {
+                               printf("%s: %p spinning waiting for wakeup\n", __func__,
+                                       &zcw->zcw_cv);

We get the following output:

<zfs`zil_commit_waiter (zil.c:2699)> zil_commit_waiter: 0xffffff90ac9b3ee8 spinning waiting for wakeup
<zfs`zil_commit_waiter (zil.c:2699)> zil_commit_waiter: 0xffffff90ac9b3ee8 spinning waiting for wakeup

There is no signal to 0xffffff90ac9b3ee8

@lundman
Copy link
Contributor Author

lundman commented May 25, 2020

Removing some 'static' in front of those two functions, we can indeed confirm neither is called:

# dtrace -n 'zil_commit_waiter_skip:entry' -n 'zil_commit_waiter_skip:return { printf("%d", arg1);}' -n 'zil_lwb_flush_vdevs_done:entry' -n 'zil_lwb_flush_vdevs_done:return {printf("%d", arg1);}'

dtrace: description 'zil_commit_waiter_skip:entry' matched 1 probe
dtrace: description 'zil_commit_waiter_skip:return ' matched 1 probe
dtrace: description 'zil_lwb_flush_vdevs_done:entry' matched 1 probe
dtrace: description 'zil_lwb_flush_vdevs_done:return ' matched 1 probe

# nada

@lundman
Copy link
Contributor Author

lundman commented May 25, 2020

Removing all static and using:

# dtrace -n 'zil_*:entry' -n 'zil_*:return { printf("%d", arg1);}'

We end up with this log in its entirety, from start to hang:

dtrace zil*

CPU     ID                    FUNCTION:NAME
  0  79256                  zil_alloc:entry
  0  79257                 zil_alloc:return -478168599168
  0  79256                  zil_alloc:entry
  0  79257                 zil_alloc:return -478168600256
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169175152
  0  79243                  zil_sync:return -549527168128
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169175152
  0  79243                  zil_sync:return -549527168128
  1  79258                   zil_free:entry
  1  79259                  zil_free:return -549527162368
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return -549527168128
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 2
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 3
  0  79256                  zil_alloc:entry
  0  79257                 zil_alloc:return -478168601344
  1  79258                   zil_free:entry
  1  79259                  zil_free:return -549527162368
  1  79256                  zil_alloc:entry
  1  79257                 zil_alloc:return -478168602432
  1  79260                   zil_open:entry
  1  79261                  zil_open:return -478168602432
  1  79274                 zil_replay:entry
  1  79164                zil_destroy:entry
  1  79165               zil_destroy:return -478169176368
  1  79275                zil_replay:return -478169176368
  0  79256                  zil_alloc:entry
  0  79257                 zil_alloc:return -478168603520
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478163571440
  1  79243                  zil_sync:return 2
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478163571440
  1  79243                  zil_sync:return 2
  0  79258                   zil_free:entry
  0  79259                  zil_free:return -549527168128
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 2
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 3
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 4
  0  79256                  zil_alloc:entry
  0  79257                 zil_alloc:return -478168603520
  0  79258                   zil_free:entry
  0  79259                  zil_free:return -549527168128
  1  79256                  zil_alloc:entry
  1  79257                 zil_alloc:return -478168600256
  0  79258                   zil_free:entry
  0  79259                  zil_free:return -549527168128
  1  79276              zil_replaying:entry
  1  79277             zil_replaying:return 0
  1  79204             zil_itx_create:entry
  1  79205            zil_itx_create:return -478162252288
  1  79212             zil_itx_assign:entry
  1  79213            zil_itx_assign:return -549527162368
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169176368
  0  79243                  zil_sync:return -549527168128
  1  79238                 zil_commit:entry
  1  79240            zil_commit_impl:entry
  1  79214          zil_async_to_sync:entry
  1  79215         zil_async_to_sync:return 9
  1  79232    zil_alloc_commit_waiter:entry
  1  79233   zil_alloc_commit_waiter:return -478162256152
  1  79236      zil_commit_itx_assign:entry
  1  79204             zil_itx_create:entry
  1  79205            zil_itx_create:return -478162252800
  1  79212             zil_itx_assign:entry
  1  79213            zil_itx_assign:return -549527162368
  1  79237     zil_commit_itx_assign:return -549527162368
  1  79226          zil_commit_writer:entry
  1  79218        zil_get_commit_list:entry
  1  79219       zil_get_commit_list:return 9
  1  79220      zil_prune_commit_list:entry
  1  79221     zil_prune_commit_list:return -478162252224
  1  79224    zil_process_commit_list:entry
  1  79162                 zil_create:entry
  1  79130         zil_init_log_chain:entry
  1  79131        zil_init_log_chain:return -477873195880
  1  79152              zil_alloc_lwb:entry
  1  79244               zil_lwb_cons:entry
  1  79245              zil_lwb_cons:return 0
  1  79153             zil_alloc_lwb:return -478162260664
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169176368
  0  79243                  zil_sync:return -549527168128
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478169167856
  1  79243                  zil_sync:return -549527162368
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478169167856
  1  79243                  zil_sync:return 2
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478169167856
  1  79243                  zil_sync:return 3
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478169167856
  1  79243                  zil_sync:return 4
  0  79216                  zil_clean:entry
  0  79217                 zil_clean:return -478166638264
  1  79206             zil_itxg_clean:entry
  1  79207            zil_itxg_clean:return -549527162368
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169176368
  0  79243                  zil_sync:return -549527168128
  1  79242                   zil_sync:entry
  1  79128 zil_header_in_syncing_context:entry
  1  79129 zil_header_in_syncing_context:return -478169167856
  1  79243                  zil_sync:return -549527162368
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 2
  0  79242                   zil_sync:entry
  0  79128 zil_header_in_syncing_context:entry
  0  79129 zil_header_in_syncing_context:return -478169167856
  0  79243                  zil_sync:return 3
  0  79216                  zil_clean:entry
  0  79217                 zil_clean:return -478166638264
  0  79206             zil_itxg_clean:entry
  0  79207            zil_itxg_clean:return -549527168128
  0  79163                zil_create:return -478162260664
  0  79186            zil_itx_destroy:entry
  0  79187           zil_itx_destroy:return -549527168128
  0  79202             zil_lwb_commit:entry
  0  79192         zil_lwb_write_open:entry
  0  79190 zil_lwb_set_zio_dependency:entry
  0  79191 zil_lwb_set_zio_dependency:return -478162260664
  0  79193        zil_lwb_write_open:return -549527168128
  0  79174 zil_commit_waiter_link_lwb:entry
  0  79175 zil_commit_waiter_link_lwb:return -549527168128
  0  79203            zil_lwb_commit:return -478162260664
  0  79225   zil_process_commit_list:return 0
  0  79227         zil_commit_writer:return -549527168128
  0  79230          zil_commit_waiter:entry
^C
(never returns)

@lundman
Copy link
Contributor Author

lundman commented May 25, 2020

The closest we come to cv_signal() appears to be entering zil_lwb_write_open() but we do not end up in zio_root( ...., zil_lwb_flush_vdevs_done, .... Because we don't issue it, or it isn't called?

@lundman
Copy link
Contributor Author

lundman commented May 25, 2020

or, the call for zil_process_commit_list() can also call zil_commit_waiter_skip() so which path are we expecting here?

@lundman
Copy link
Contributor Author

lundman commented May 25, 2020

They turn out to to skip due to:

zil_lwb_write_open(zilog_t *zilog, lwb_t *lwb)
{
 ....
    printf("%s: lwb_root_zio is %p\n", __func__, lwb->lwb_root_zio);
    if (lwb->lwb_root_zio == NULL) {

<zfs`zil_lwb_write_open (zil.c:1380)> zil_lwb_write_open: lwb_root_zio is 0x0

and

zil_process_commit_list(zilog_t *zilog)
{
....
    printf("%s: lwb is %p\n", __func__, lwb);
    if (lwb == NULL) {

<zfs`zil_process_commit_list (zil.c:2354)> zil_process_commit_list: lwb is 0xffffff90a608cd48

Might have to call upon @behlendorf to see if he has some insight.

@lundman
Copy link
Contributor Author

lundman commented May 25, 2020

Ok, it would seem related to:

            if (timeleft >= 0 || zcw->zcw_done)
                continue;

if replaced with:

            if (timeleft != -1 || zcw->zcw_done)
                continue;

Perhaps we have some signed vs unsigned issue going for it with clock_t.

Should I #ifdef APPLE around it, or would upstream be accepting of the alternate style?

@lundman
Copy link
Contributor Author

lundman commented May 25, 2020

adeb452

@lundman lundman closed this as completed Jun 11, 2020
@behlendorf
Copy link
Contributor

It looks like upstream may suffer from this same issue, openzfs/zfs#10440. It may be just a little harder to hit. Would you mind upstreaming your fix?

@lundman
Copy link
Contributor Author

lundman commented Jun 11, 2020

It's in the queue, but will move it forward and push out today

@lundman
Copy link
Contributor Author

lundman commented Jun 12, 2020

I don't see how it could affect Linux, but doesn't hurt to try

@behlendorf
Copy link
Contributor

Me either. It's a signed type on Linux so it should be fine, but we'll need this change anyway for macos.

@adamdmoss
Copy link
Contributor

adamdmoss commented Jun 15, 2020

Added a review comment, but in case it gets lost I have the following concern:

AFAICT, the magic -1 number only means 'timeout' in the user-space version of cv_timedwait_hires(), and not (at least for the freebsd spl) the kernel versions of cv_timedwait_hires().

So I'm a bit worried that this is incorrect on some (or all ;) ) platforms, and is equivalent to just removing the 'timeleft' comparison completely...

@lundman
Copy link
Contributor Author

lundman commented Jun 15, 2020

Likewise:

Since SPL is emulating the illumos API, we can see that cv_timedwait_hires() returns -1 for timeout:
https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/condvar.c#L298

Linux and macOS both also do the same:
https://github.com/openzfs/zfs/blob/master/module/os/linux/spl/spl-condvar.c#L391
https://github.com/openzfsonosx/openzfs/blob/sorted2/module/os/macos/spl/spl-condvar.c#L243

Linux clock_t is signed, so they were fine. macOS clock_t is unsigned.
This test only cared that it had been signaled ( >= 0) or not. Which unfortunately, "-1" is also triggered(unsigned).
The only way out of this loop was to return -1.

I just negated the test, to check directly for -1 (or the unsigned value) The caller doesn't care about the "time left". This was to fix a hang in kernel code, since -1,0,timeleft all triggered ">=0" forever.

FreeBSD should probably correct their cv_timedwait_hires. Currently, you have no way to know if something was signaled (when returning 0 time left) and timeout expired, no signal. Over all, positive return value means "signal received" and negative means "no signal received".

@mattmacy
Copy link
Contributor

Did this PR pass the FreeBSD bots? After rebasing I get tests hangs.

@behlendorf
Copy link
Contributor

The CI did in fact catch this, but due to the other recent issues with FreeBSD head I didn't catch this at the time and it slipped in. Sorry!

@adamdmoss
Copy link
Contributor

adamdmoss commented Jun 15, 2020

@lundman - thanks for the explanation, I think 'git grep' failed me when looking for what was the normal behavior.

Glad I sort of accidentally found a bug in the freebsd SPL though. 🤣

lundman pushed a commit that referenced this issue Feb 22, 2021
Mixing ZIL and normal allocations has several problems:

1. The ZIL allocations are allocated, written to disk, and then a few
seconds later freed.  This leaves behind holes (free segments) where the
ZIL blocks used to be, which increases fragmentation, which negatively
impacts performance.

2. When under moderate load, ZIL allocations are of 128KB.  If the pool
is fairly fragmented, there may not be many free chunks of that size.
This causes ZFS to load more metaslabs to locate free segments of 128KB
or more.  The loading happens synchronously (from zil_commit()), and can
take around a second even if the metaslab's spacemap is cached in the
ARC.  All concurrent synchronous operations on this filesystem must wait
while the metaslab is loading.  This can cause a significant performance
impact.

3. If the pool is very fragmented, there may be zero free chunks of
128KB or more.  In this case, the ZIL falls back to txg_wait_synced(),
which has an enormous performance impact.

These problems can be eliminated by using a dedicated log device
("slog"), even one with the same performance characteristics as the
normal devices.

This change sets aside one metaslab from each top-level vdev that is
preferentially used for ZIL allocations (vdev_log_mg,
spa_embedded_log_class).  From an allocation perspective, this is
similar to having a dedicated log device, and it eliminates the
above-mentioned performance problems.

Log (ZIL) blocks can be allocated from the following locations.  Each
one is tried in order until the allocation succeeds:
1. dedicated log vdevs, aka "slog" (spa_log_class)
2. embedded slog metaslabs (spa_embedded_log_class)
3. other metaslabs in normal vdevs (spa_normal_class)

The space required for the embedded slog metaslabs is usually between
0.5% and 1.0% of the pool, and comes out of the existing 3.2% of "slop"
space that is not available for user data.

On an all-ssd system with 4TB storage, 87% fragmentation, 60% capacity,
and recordsize=8k, testing shows a ~50% performance increase on random
8k sync writes.  On even more fragmented systems (which hit problem #3
above and call txg_wait_synced()), the performance improvement can be
arbitrarily large (>100x).

Reviewed-by: Serapheim Dimitropoulos <[email protected]>
Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Don Brady <[email protected]>
Reviewed-by: Mark Maybee <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #11389
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants