txg_sync hung in zio_wait() #1928

nedbass · 2013-12-04T22:55:50Z

In testing #1696 a virtual machine running Fedora Core 19 hung running the filebench test. The txg_sync thread is stuck in zio_wait():

# cat /proc/`pidof txg_sync`/stack
[<ffffffffa02cf17a>] cv_wait_common+0xda/0x3b0 [spl]
[<ffffffffa02cf4a8>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa0457c7b>] zio_wait+0x1fb/0x490 [zfs]
[<ffffffffa03bbca6>] dsl_pool_sync+0x136/0x730 [zfs]
[<ffffffffa03dc599>] spa_sync+0x429/0xd90 [zfs]
[<ffffffffa03f44b4>] txg_sync_thread+0x374/0x6b0 [zfs]
[<ffffffffa02c5433>] thread_generic_wrapper+0x83/0xe0 [spl]
[<ffffffff81088660>] kthread+0xc0/0xd0
[<ffffffff816567ac>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

Here are zpool events -v and task state output.

Decoding some of the zio fields from the zevent data, we have

zio_flags = 0x8020640 /* ZIO_FLAG_CANFAIL |
                         ZIO_FLAG_DONT_RETRY |
                         ZIO_FLAG_DONT_CACHE |
                         ZIO_FLAG_DONT_QUEUE |
                         ZIO_FLAG_DELEGATED */

zio_stage = 0x20000 /* ZIO_STAGE_VDEV_IO_START */

zio_pipeline = 0x2e0000 /* ZIO_STAGE_VDEV_IO_START |
                           ZIO_STAGE_VDEV_IO_DONE |
                           ZIO_STAGE_VDEV_IO_ASSESS |
                           ZIO_STAGE_DONE */

The text was updated successfully, but these errors were encountered:

Record the type, priority, and reexecute fields of a ZIO in the zevent log. This may provide helpful debugging information if a ZIO is hung. Signed-off-by: Ned Bass <[email protected]> Issue openzfs#1928

behlendorf · 2013-12-05T18:39:55Z

With the above patch applied I reproduced the issue and in addition to zio_flags, zio_stage, zio_pipeline we have:

zio_type = 0x2 /*  ZIO_TYPE_WRITE */
zio_priority = 0x3
zio_reexecute = 0x0

If a zio hangs log its address in the zevent to we can chase it with a kernel debugger. This is a debug-only patch since we wouldn't normally want user tools to expose kernel memory addresses. Signed-off-by: Ned Bass <[email protected]> Issue openzfs#1928

nedbass · 2014-01-11T01:13:10Z

I reproduced this and got a dump of the missing zio using crash.

behlendorf · 2014-01-13T18:29:58Z

@nedbass Now that is interesting. The IO which is hung is a child IO for the mirror, and it appears to never have been dispatched to the Linux block layer. That means that either vdev_disk_io_start() was never called or there's some flaw which can cause it to strand the IO.

What should happen is vdev_disk_io_start() will be called to start the IO in the ZIO_STAGE_VDEV_IO_START stage (which we're in). vdev_disk_io_start() in turn calls __vdev_disk_physio() which maps the individual pages in the zio in to a bio and submits it on to the kernel. The odd thing is if this process had succeeded (like it virtually always does) I'd have expected to see zio->io_delay set to the time in jiffies which the IO was submitted, not 0. Although if it had failed zio->io_error should contain the error code from __vdev_disk_physio), not 0. It's as if we returned success and never submitted the IO, but I don't see how that's possible.

If you can reproduce this it would be interested to instrument the error paths in __vdev_disk_physio(). Perhaps we are hitting one of the unlikely failure cases and then somehow the error code is being lost/overwritten?.

nedbass · 2014-01-13T23:11:05Z

Also, this strikes me as suspicious:

io_physdone = 0xffffffffa03b8000 <arc_write_physdone>,
io_done = 0xffffffffa045b7e0 <vdev_mirror_child_done>,

especially since the pool is not a mirror but a single file vdev. Should that not be arc_write_done and arc_write_physdone?

nedbass · 2014-01-15T20:13:51Z

It seems like the taskq threads are deadlocking trying to do I/O during memory reclaim. I have a system in this state, and two z_wr_int_X threads with this stack:

[<ffffffffa02d225d>] cv_wait_common+0x13d/0x3b0 [spl]
[<ffffffffa02d24e5>] __cv_wait+0x15/0x20 [spl]
[<ffffffffa04437b3>] txg_wait_open+0xc3/0x230 [zfs]
[<ffffffffa03e70e3>] dmu_tx_wait+0x443/0x4e0 [zfs]
[<ffffffffa03e7259>] dmu_tx_assign+0xd9/0x810 [zfs]
[<ffffffffa0470529>] zfs_rmnode+0x159/0x4e0 [zfs]
[<ffffffffa0499f78>] zfs_zinactive+0xa8/0x1a0 [zfs]
[<ffffffffa0491761>] zfs_inactive+0x71/0x470 [zfs]
[<ffffffffa04b3d74>] zpl_evict_inode+0x24/0x30 [zfs]
[<ffffffff811c839f>] evict+0x9f/0x190
[<ffffffff811c84ce>] dispose_list+0x3e/0x50
[<ffffffff811c9327>] prune_icache_sb+0x47/0x60
[<ffffffff811b1f49>] super_cache_scan+0xf9/0x160
[<ffffffff81153a6f>] shrink_slab+0x1cf/0x380
[<ffffffff81156cd1>] do_try_to_free_pages+0x491/0x670
[<ffffffff81156f9f>] try_to_free_pages+0xef/0x170
[<ffffffff8114b659>] __alloc_pages_nodemask+0x739/0xaa0
[<ffffffff8118a2d9>] alloc_pages_current+0xa9/0x170
[<ffffffff81192bcd>] new_slab+0x2ad/0x350
[<ffffffff81660d09>] __slab_alloc+0x2a4/0x4a2
[<ffffffff8119508c>] __kmalloc+0x1cc/0x240
[<ffffffffa02c6a58>] kmem_alloc_debug+0x258/0x420 [spl]
[<ffffffffa02c999f>] task_alloc+0x1af/0x510 [spl]
[<ffffffffa02c9ecb>] taskq_dispatch+0x1cb/0x470 [spl]
[<ffffffffa0450b86>] vdev_file_io_start+0x46/0x160 [zfs]
[<ffffffffa04aad03>] zio_vdev_io_start+0x113/0x570 [zfs]
[<ffffffffa04a2d80>] zio_execute+0xf0/0x310 [zfs]
[<ffffffffa045607d>] vdev_queue_io_done+0x15d/0x3d0 [zfs]
[<ffffffffa04a1fa8>] zio_vdev_io_done+0x88/0x1e0 [zfs]
[<ffffffffa04a2d80>] zio_execute+0xf0/0x310 [zfs]
[<ffffffffa02ca73d>] taskq_thread+0x24d/0x630 [spl]
[<ffffffff8108b0d0>] kthread+0xc0/0xd0
[<ffffffff816715fc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

So if txg_sync is waiting on these zio's, obviously txg_wait_open() is never going to return.

Should vdev_file_io_start use TQ_PUSHPAGE instead of TQ_SLEEP?

The vdev_file_io_start() function may be processing a zio that the txg_sync thread is waiting on. In this case it is not safe to perform memory allocations that may generate new I/O since this could cause a deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE instead of TQ_SLEEP. Issue openzfs#1928

When the txg_sync thread issues an I/O and blocks in zio_wait(). Then it is critical that all processes involved in handling that I/O use KM_PUSHPAGE when performing allocations. If they use KM_SLEEP then it is possible that during a low memory condition direct reclaim will be invoked and it may attempt to flush dirty data to the file system. This will result in the thread attempting to assign a TX will block until the txg_sync thread completes. The end result is a deadlock with the txg_sync thread blocked on the original I/O, and the taskq thread blocked on the txg_sync thread. To prevent developers from accidentally introducing this type of deadlock. This patch passes the TQ_PUSHPAGE_THREAD flag when dispatching a I/O to a taskq which the txg_sync thread is waiting on. This causes the taskq thread to set the PF_NOFS flag in its task struct while the zio_execute() function is being executed. Thereby ensuring that any accidental misuse of the KM_SLEEP is quickly causes and fixed. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1928

When the TQ_PUSHPAGE_THREAD flag is passed to taskq_dispatch() then the PF_NOFS bit will be bit will be set while the passed function is executing. This makes it possible to detect when the KM_SLEEP is being used inappropriately even for delegated work. Signed-off-by: Brian Behlendorf <[email protected]> openzfs/zfs#1928

behlendorf · 2014-01-16T01:08:50Z

@nedbass Yes, that's exactly what's going on. I've pushed two additional patches for review which extend the existing infrastructure so we can automatically detect cases like this.

nedbass · 2014-01-16T01:25:52Z

After fixing the above TQ_PUSHPAGE bug, I ran into another similar deadlock which I haven't fully analyzed yet. Here are the interesting stacks:

txg_sync
[<ffffffffa02d61fa>] cv_wait_common+0xda/0x3b0 [spl]
[<ffffffffa02d6528>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa04abdfb>] zio_wait+0x1fb/0x490 [zfs]
[<ffffffffa040ff16>] dsl_pool_sync+0x136/0x730 [zfs]
[<ffffffffa04307e5>] spa_sync+0x425/0xd80 [zfs]
[<ffffffffa04486e4>] txg_sync_thread+0x374/0x6b0 [zfs]
[<ffffffffa02cc493>] thread_generic_wrapper+0x83/0xe0 [spl]
[<ffffffff8108b0d0>] kthread+0xc0/0xd0
[<ffffffff816715fc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

zfs_iput_taskq_
[<ffffffff811c7821>] __wait_on_freeing_inode+0x91/0xc0
[<ffffffff811c7a31>] find_inode_fast+0x91/0xb0
[<ffffffff811c854a>] ilookup+0x6a/0xc0
[<ffffffffa049d343>] zfs_zget+0xe3/0x3d0 [zfs]
[<ffffffffa047584e>] zfs_unlinked_drain+0x8e/0x180 [zfs]
[<ffffffffa02ce73d>] taskq_thread+0x24d/0x630 [spl]
[<ffffffff8108b0d0>] kthread+0xc0/0xd0
[<ffffffff816715fc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

kswapd0
[<ffffffffa02d625d>] cv_wait_common+0x13d/0x3b0 [spl]
[<ffffffffa02d64e5>] __cv_wait+0x15/0x20 [spl]
[<ffffffffa04477b3>] txg_wait_open+0xc3/0x230 [zfs]
[<ffffffffa03eb0e3>] dmu_tx_wait+0x443/0x4e0 [zfs]
[<ffffffffa03eb259>] dmu_tx_assign+0xd9/0x810 [zfs]
[<ffffffffa0474529>] zfs_rmnode+0x159/0x4e0 [zfs]
[<ffffffffa049df78>] zfs_zinactive+0xa8/0x1a0 [zfs]
[<ffffffffa0495761>] zfs_inactive+0x71/0x470 [zfs]
[<ffffffffa04b7d74>] zpl_evict_inode+0x24/0x30 [zfs]
[<ffffffff811c839f>] evict+0x9f/0x190
[<ffffffff811c8b85>] iput+0xf5/0x180
[<ffffffffa04999d8>] zfs_inode_destroy+0xf8/0x1d0 [zfs]
[<ffffffffa04b7db3>] zpl_inode_destroy+0x23/0x70 [zfs]
[<ffffffff811c82d8>] destroy_inode+0x38/0x60
[<ffffffff811c8407>] evict+0x107/0x190
[<ffffffff811c84ce>] dispose_list+0x3e/0x50
[<ffffffff811c9327>] prune_icache_sb+0x47/0x60
[<ffffffff811b1f49>] super_cache_scan+0xf9/0x160
[<ffffffff81153a6f>] shrink_slab+0x1cf/0x380
[<ffffffff81156497>] kswapd_shrink_zone+0x117/0x1b0
[<ffffffff811577dc>] kswapd+0x45c/0x840
[<ffffffff8108b0d0>] kthread+0xc0/0xd0
[<ffffffff816715fc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

behlendorf · 2014-01-16T17:22:25Z

@nedbass Both the zfs_iput_taskq and kswapd0 threads looks like additional collateral damage to me. Neither should be holding any resources which will prevent the I/O from completing.

When the txg_sync thread issues an I/O and blocks in zio_wait(). Then it is critical that all processes involved in handling that I/O use KM_PUSHPAGE when performing allocations. If they use KM_SLEEP then it is possible that during a low memory condition direct reclaim will be invoked and it may attempt to flush dirty data to the file system. This will result in the thread attempting to assign a TX will block until the txg_sync thread completes. The end result is a deadlock with the txg_sync thread blocked on the original I/O, and the taskq thread blocked on the txg_sync thread. To prevent developers from accidentally introducing this type of deadlock. This patch passes the TQ_PUSHPAGE_THREAD flag when dispatching a I/O to a taskq which the txg_sync thread is waiting on. This causes the taskq thread to set the PF_NOFS flag in its task struct while the zio_execute() function is being executed. Thereby ensuring that any accidental misuse of the KM_SLEEP is quickly causes and fixed. Finally, this patch addresses the vdev_file_io_start() function which may be processing a zio that the txg_sync thread is waiting on. In this case it is not safe to perform memory allocations that may generate new I/O since this could cause a deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE instead of TQ_SLEEP. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1928

The vdev_file_io_start() function may be processing a zio that the txg_sync thread is waiting on. In this case it is not safe to perform memory allocations that may generate new I/O since this could cause a deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE instead of TQ_SLEEP. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue #1928

behlendorf · 2014-01-23T23:05:38Z

@nedbass I've merged your initial KM_PUSHPAGE fix in to master which is clearly correct and addresses the first deadlock described above. I've going to hold off merging any of the more involved debugging patches until we have a better handle of the other similar issue you encountered.

04aa2de vdev_file_io_start() to use taskq_dispatch(TQ_PUSHPAGE)

The vdev_file_io_start() function may be processing a zio that the txg_sync thread is waiting on. In this case it is not safe to perform memory allocations that may generate new I/O since this could cause a deadlock. To avoid this, call taskq_dispatch() with TQ_PUSHPAGE instead of TQ_SLEEP. Signed-off-by: Ned Bass <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#1928

ryao · 2014-06-19T23:58:00Z

@nedbass 6f9548c should have fixed this.

behlendorf · 2014-06-24T00:03:38Z

@ryao I suspect you're right, but I'll need to spend more time looking at this to say for certain. However, we have stopped seeing this failure which does suggest it was fixed.

behlendorf · 2014-12-23T22:49:10Z

Closing. This was almost certainly addressed by openzfs/spl@a3c1eb7.

dweeezil · 2015-05-07T02:18:39Z

I've been working on a system with @eolson78 which shows virtually identical delay events and very similar stacks. It can be reproduced rather easily on a fairly large EC2 instance to which a pair of NFS clients are running the "fileserver" workload from filebench and also on which the snapshots and zfs send operations are being performed.

The blocked process stacks from syslog and the event log are logs area available at https://gist.github.com/dweeezil/c17b9b759935bec0045a.

The installation of ZFS is mostly 0.6.3 but with the mutex_exit serialization and "pipeline invoke next stage" patches added as well as a few other post-0.6.3 patches. It should very closely match the code at https://github.com/dweeezil/zfs/tree/softnas and https://github.com/dweeezil/spl/tree/softnas.

After the IO is blocked for a sufficiently long time, the NFS clients are disconnected and once that happens, the system returns to normal operation fairly quickly.

This may certainly be a completely unrelated problem but since it looked so similar to this issue, I figured a followup to it would be warranted even though the issue is closed.

Record the type, priority, and reexecute fields of a ZIO in the zevent log. This may provide helpful debugging information if a ZIO is hung. Signed-off-by: Ned Bass <[email protected]> Issue openzfs#1928

If a zio hangs log its address in the zevent to we can chase it with a kernel debugger. This is a debug-only patch since we wouldn't normally want user tools to expose kernel memory addresses. Signed-off-by: Ned Bass <[email protected]> Issue openzfs#1928

behlendorf · 2015-05-07T22:34:54Z

This is probably unrelated, but I thought it worth mentioning that EC2 does do things like IO limiting. Perhaps they're doing something else in their IO stack which is causing the issue.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

dweeezil · 2015-05-08T02:56:47Z

@behlendorf I'm not sure it's unrelated at all. I've become concerned about either the Xen blkfront driver, itself, or a related problem on its mating driver on the back end (rate limiting?).

During one of the apparent hang periods, I noticed the inflight counters for the (single vdev) xvd device seemingly "stuck":

 ==> /sys/block/xvdg/inflight <==
       3       10

and a suspiciously matching counter in the io kstat for the pool:

# cat /proc/spl/kstat/zfs/pserver/io
11 3 0x00 1 80 8497935534207 180183974038120
nread    nwritten reads    writes   wtime    wlentime wupdate  rtime    rlentime rupdate  wcnt     rcnt    
11206622687232 5908279152640 130835592 63261358 148450349353521 2347351960395491282 180183973883389 149605975852428 2347351960395491282 180183973883642 5580     13

I posted the followup to this closed issue because it appeared to almost precisely match @nedbass' original data points: stacks, zio pipeline & flags, etc.

It just dawned on me that I might try some non-ZFS I/O to the same device during these periods where the I/O seems to be hung up.

behlendorf mentioned this issue Dec 6, 2013

Suspended pool causes utilities to hang #1857

Closed

This was referenced Jan 16, 2014

Annotate tasks which require KM_PUSHPAGE #2052

Closed

Add TQ_PUSHPAGE_THREAD flag openzfs/spl#327

Closed

behlendorf mentioned this issue Feb 14, 2014

stuck on starting zfs service after box locked up (task sync:2202 blocked for more than 120 seconds.) #2126

Closed

behlendorf modified the milestones: 0.6.4, 0.6.3 Apr 4, 2014

behlendorf added Bug - Minor and removed Bug labels Oct 29, 2014

behlendorf removed this from the 0.6.4 milestone Oct 29, 2014

behlendorf closed this as completed Dec 23, 2014

kernelOfTruth mentioned this issue Sep 6, 2015

Recurring ZFS deadlock; zfs_iput_taskq stuck at 100% for minutes #3687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

txg_sync hung in zio_wait() #1928

txg_sync hung in zio_wait() #1928

nedbass commented Dec 4, 2013

behlendorf commented Dec 5, 2013

nedbass commented Jan 11, 2014

behlendorf commented Jan 13, 2014

nedbass commented Jan 13, 2014

nedbass commented Jan 15, 2014

behlendorf commented Jan 16, 2014

nedbass commented Jan 16, 2014

behlendorf commented Jan 16, 2014

behlendorf commented Jan 23, 2014

ryao commented Jun 19, 2014

behlendorf commented Jun 24, 2014

behlendorf commented Dec 23, 2014

dweeezil commented May 7, 2015

behlendorf commented May 7, 2015

dweeezil commented May 8, 2015

txg_sync hung in zio_wait() #1928

txg_sync hung in zio_wait() #1928

Comments

nedbass commented Dec 4, 2013

behlendorf commented Dec 5, 2013

nedbass commented Jan 11, 2014

behlendorf commented Jan 13, 2014

nedbass commented Jan 13, 2014

nedbass commented Jan 15, 2014

behlendorf commented Jan 16, 2014

nedbass commented Jan 16, 2014

behlendorf commented Jan 16, 2014

behlendorf commented Jan 23, 2014

ryao commented Jun 19, 2014

behlendorf commented Jun 24, 2014

behlendorf commented Dec 23, 2014

dweeezil commented May 7, 2015

behlendorf commented May 7, 2015

dweeezil commented May 8, 2015