Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txg_sync hung in zio_wait() #1928

Closed
nedbass opened this issue Dec 4, 2013 · 15 comments
Closed

txg_sync hung in zio_wait() #1928

nedbass opened this issue Dec 4, 2013 · 15 comments

Comments

@nedbass
Copy link
Contributor

nedbass commented Dec 4, 2013

In testing #1696 a virtual machine running Fedora Core 19 hung running the filebench test. The txg_sync thread is stuck in zio_wait():

# cat /proc/`pidof txg_sync`/stack
[<ffffffffa02cf17a>] cv_wait_common+0xda/0x3b0 [spl]
[<ffffffffa02cf4a8>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa0457c7b>] zio_wait+0x1fb/0x490 [zfs]
[<ffffffffa03bbca6>] dsl_pool_sync+0x136/0x730 [zfs]
[<ffffffffa03dc599>] spa_sync+0x429/0xd90 [zfs]
[<ffffffffa03f44b4>] txg_sync_thread+0x374/0x6b0 [zfs]
[<ffffffffa02c5433>] thread_generic_wrapper+0x83/0xe0 [spl]
[<ffffffff81088660>] kthread+0xc0/0xd0
[<ffffffff816567ac>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

Here are zpool events -v and task state output.

Decoding some of the zio fields from the zevent data, we have

zio_flags = 0x8020640 /* ZIO_FLAG_CANFAIL |
                         ZIO_FLAG_DONT_RETRY |
                         ZIO_FLAG_DONT_CACHE |
                         ZIO_FLAG_DONT_QUEUE |
                         ZIO_FLAG_DELEGATED */

zio_stage = 0x20000 /* ZIO_STAGE_VDEV_IO_START */

zio_pipeline = 0x2e0000 /* ZIO_STAGE_VDEV_IO_START |
                           ZIO_STAGE_VDEV_IO_DONE |
                           ZIO_STAGE_VDEV_IO_ASSESS |
                           ZIO_STAGE_DONE */
nedbass added a commit to nedbass/zfs that referenced this issue Dec 5, 2013
Record the type, priority, and reexecute fields of a ZIO in the
zevent log.  This may provide helpful debugging information if
a ZIO is hung.

Signed-off-by: Ned Bass <[email protected]>

Issue openzfs#1928
@behlendorf
Copy link
Contributor

With the above patch applied I reproduced the issue and in addition to zio_flags, zio_stage, zio_pipeline we have:

zio_type = 0x2 /*  ZIO_TYPE_WRITE */
zio_priority = 0x3
zio_reexecute = 0x0

nedbass added a commit to nedbass/zfs that referenced this issue Dec 6, 2013
If a zio hangs log its address in the zevent to we can chase it with a
kernel debugger.  This is a debug-only patch since we wouldn't normally
want user tools to expose kernel memory addresses.

Signed-off-by: Ned Bass <[email protected]>

Issue openzfs#1928
@nedbass
Copy link
Contributor Author

nedbass commented Jan 11, 2014

I reproduced this and got a dump of the missing zio using crash.

@behlendorf
Copy link
Contributor

@nedbass Now that is interesting. The IO which is hung is a child IO for the mirror, and it appears to never have been dispatched to the Linux block layer. That means that either vdev_disk_io_start() was never called or there's some flaw which can cause it to strand the IO.

What should happen is vdev_disk_io_start() will be called to start the IO in the ZIO_STAGE_VDEV_IO_START stage (which we're in). vdev_disk_io_start() in turn calls __vdev_disk_physio() which maps the individual pages in the zio in to a bio and submits it on to the kernel. The odd thing is if this process had succeeded (like it virtually always does) I'd have expected to see zio->io_delay set to the time in jiffies which the IO was submitted, not 0. Although if it had failed zio->io_error should contain the error code from __vdev_disk_physio), not 0. It's as if we returned success and never submitted the IO, but I don't see how that's possible.

If you can reproduce this it would be interested to instrument the error paths in __vdev_disk_physio(). Perhaps we are hitting one of the unlikely failure cases and then somehow the error code is being lost/overwritten?.

@nedbass
Copy link
Contributor Author

nedbass commented Jan 13, 2014

Also, this strikes me as suspicious:

io_physdone = 0xffffffffa03b8000 <arc_write_physdone>,
io_done = 0xffffffffa045b7e0 <vdev_mirror_child_done>, 

especially since the pool is not a mirror but a single file vdev. Should that not be arc_write_done and arc_write_physdone?

@nedbass
Copy link
Contributor Author

nedbass commented Jan 15, 2014

It seems like the taskq threads are deadlocking trying to do I/O during memory reclaim. I have a system in this state, and two z_wr_int_X threads with this stack:

[<ffffffffa02d225d>] cv_wait_common+0x13d/0x3b0 [spl]
[<ffffffffa02d24e5>] __cv_wait+0x15/0x20 [spl]
[<ffffffffa04437b3>] txg_wait_open+0xc3/0x230 [zfs]
[<ffffffffa03e70e3>] dmu_tx_wait+0x443/0x4e0 [zfs]
[<ffffffffa03e7259>] dmu_tx_assign+0xd9/0x810 [zfs]
[<ffffffffa0470529>] zfs_rmnode+0x159/0x4e0 [zfs]
[<ffffffffa0499f78>] zfs_zinactive+0xa8/0x1a0 [zfs]
[<ffffffffa0491761>] zfs_inactive+0x71/0x470 [zfs]
[<ffffffffa04b3d74>] zpl_evict_inode+0x24/0x30 [zfs]
[<ffffffff811c839f>] evict+0x9f/0x190
[<ffffffff811c84ce>] dispose_list+0x3e/0x50
[<ffffffff811c9327>] prune_icache_sb+0x47/0x60
[<ffffffff811b1f49>] super_cache_scan+0xf9/0x160
[<ffffffff81153a6f>] shrink_slab+0x1cf/0x380
[<ffffffff81156cd1>] do_try_to_free_pages+0x491/0x670
[<ffffffff81156f9f>] try_to_free_pages+0xef/0x170
[<ffffffff8114b659>] __alloc_pages_nodemask+0x739/0xaa0
[<ffffffff8118a2d9>] alloc_pages_current+0xa9/0x170
[<ffffffff81192bcd>] new_slab+0x2ad/0x350
[<ffffffff81660d09>] __slab_alloc+0x2a4/0x4a2
[<ffffffff8119508c>] __kmalloc+0x1cc/0x240
[<ffffffffa02c6a58>] kmem_alloc_debug+0x258/0x420 [spl]
[<ffffffffa02c999f>] task_alloc+0x1af/0x510 [spl]
[<ffffffffa02c9ecb>] taskq_dispatch+0x1cb/0x470 [spl]
[<ffffffffa0450b86>] vdev_file_io_start+0x46/0x160 [zfs]
[<ffffffffa04aad03>] zio_vdev_io_start+0x113/0x570 [zfs]
[<ffffffffa04a2d80>] zio_execute+0xf0/0x310 [zfs]
[<ffffffffa045607d>] vdev_queue_io_done+0x15d/0x3d0 [zfs]
[<ffffffffa04a1fa8>] zio_vdev_io_done+0x88/0x1e0 [zfs]
[<ffffffffa04a2d80>] zio_execute+0xf0/0x310 [zfs]
[<ffffffffa02ca73d>] taskq_thread+0x24d/0x630 [spl]
[<ffffffff8108b0d0>] kthread+0xc0/0xd0
[<ffffffff816715fc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

So if txg_sync is waiting on these zio's, obviously txg_wait_open() is never going to return.

Should vdev_file_io_start use TQ_PUSHPAGE instead of TQ_SLEEP?

nedbass added a commit to nedbass/zfs that referenced this issue Jan 15, 2014
The vdev_file_io_start() function may be processing a zio that the
txg_sync thread is waiting on.  In this case it is not safe to perform
memory allocations that may generate new I/O since this could cause a
deadlock.  To avoid this, call taskq_dispatch() with TQ_PUSHPAGE
instead of TQ_SLEEP.

Issue openzfs#1928
behlendorf added a commit to behlendorf/zfs that referenced this issue Jan 16, 2014
When the txg_sync thread issues an I/O and blocks in zio_wait().
Then it is critical that all processes involved in handling that
I/O use KM_PUSHPAGE when performing allocations.

If they use KM_SLEEP then it is possible that during a low memory
condition direct reclaim will be invoked and it may attempt to
flush dirty data to the file system.  This will result in the
thread attempting to assign a TX will block until the txg_sync
thread completes.

The end result is a deadlock with the txg_sync thread blocked on
the original I/O, and the taskq thread blocked on the txg_sync
thread.

To prevent developers from accidentally introducing this type
of deadlock.  This patch passes the TQ_PUSHPAGE_THREAD flag when
dispatching a I/O to a taskq which the txg_sync thread is waiting
on.  This causes the taskq thread to set the PF_NOFS flag in its
task struct while the zio_execute() function is being executed.
Thereby ensuring that any accidental misuse of the KM_SLEEP is
quickly causes and fixed.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#1928
behlendorf added a commit to behlendorf/spl that referenced this issue Jan 16, 2014
When the TQ_PUSHPAGE_THREAD flag is passed to taskq_dispatch() then
the PF_NOFS bit will be bit will be set while the passed function is
executing.  This makes it possible to detect when the KM_SLEEP is
being used inappropriately even for delegated work.

Signed-off-by: Brian Behlendorf <[email protected]>
openzfs/zfs#1928
@behlendorf
Copy link
Contributor

@nedbass Yes, that's exactly what's going on. I've pushed two additional patches for review which extend the existing infrastructure so we can automatically detect cases like this.

@nedbass
Copy link
Contributor Author

nedbass commented Jan 16, 2014

After fixing the above TQ_PUSHPAGE bug, I ran into another similar deadlock which I haven't fully analyzed yet. Here are the interesting stacks:

txg_sync
[<ffffffffa02d61fa>] cv_wait_common+0xda/0x3b0 [spl]
[<ffffffffa02d6528>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa04abdfb>] zio_wait+0x1fb/0x490 [zfs]
[<ffffffffa040ff16>] dsl_pool_sync+0x136/0x730 [zfs]
[<ffffffffa04307e5>] spa_sync+0x425/0xd80 [zfs]
[<ffffffffa04486e4>] txg_sync_thread+0x374/0x6b0 [zfs]
[<ffffffffa02cc493>] thread_generic_wrapper+0x83/0xe0 [spl]
[<ffffffff8108b0d0>] kthread+0xc0/0xd0
[<ffffffff816715fc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

zfs_iput_taskq_
[<ffffffff811c7821>] __wait_on_freeing_inode+0x91/0xc0
[<ffffffff811c7a31>] find_inode_fast+0x91/0xb0
[<ffffffff811c854a>] ilookup+0x6a/0xc0
[<ffffffffa049d343>] zfs_zget+0xe3/0x3d0 [zfs]
[<ffffffffa047584e>] zfs_unlinked_drain+0x8e/0x180 [zfs]
[<ffffffffa02ce73d>] taskq_thread+0x24d/0x630 [spl]
[<ffffffff8108b0d0>] kthread+0xc0/0xd0
[<ffffffff816715fc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

kswapd0
[<ffffffffa02d625d>] cv_wait_common+0x13d/0x3b0 [spl]
[<ffffffffa02d64e5>] __cv_wait+0x15/0x20 [spl]
[<ffffffffa04477b3>] txg_wait_open+0xc3/0x230 [zfs]
[<ffffffffa03eb0e3>] dmu_tx_wait+0x443/0x4e0 [zfs]
[<ffffffffa03eb259>] dmu_tx_assign+0xd9/0x810 [zfs]
[<ffffffffa0474529>] zfs_rmnode+0x159/0x4e0 [zfs]
[<ffffffffa049df78>] zfs_zinactive+0xa8/0x1a0 [zfs]
[<ffffffffa0495761>] zfs_inactive+0x71/0x470 [zfs]
[<ffffffffa04b7d74>] zpl_evict_inode+0x24/0x30 [zfs]
[<ffffffff811c839f>] evict+0x9f/0x190
[<ffffffff811c8b85>] iput+0xf5/0x180
[<ffffffffa04999d8>] zfs_inode_destroy+0xf8/0x1d0 [zfs]
[<ffffffffa04b7db3>] zpl_inode_destroy+0x23/0x70 [zfs]
[<ffffffff811c82d8>] destroy_inode+0x38/0x60
[<ffffffff811c8407>] evict+0x107/0x190
[<ffffffff811c84ce>] dispose_list+0x3e/0x50
[<ffffffff811c9327>] prune_icache_sb+0x47/0x60
[<ffffffff811b1f49>] super_cache_scan+0xf9/0x160
[<ffffffff81153a6f>] shrink_slab+0x1cf/0x380
[<ffffffff81156497>] kswapd_shrink_zone+0x117/0x1b0
[<ffffffff811577dc>] kswapd+0x45c/0x840
[<ffffffff8108b0d0>] kthread+0xc0/0xd0
[<ffffffff816715fc>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

@behlendorf
Copy link
Contributor

@nedbass Both the zfs_iput_taskq and kswapd0 threads looks like additional collateral damage to me. Neither should be holding any resources which will prevent the I/O from completing.

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jan 17, 2014
When the txg_sync thread issues an I/O and blocks in zio_wait().
Then it is critical that all processes involved in handling that
I/O use KM_PUSHPAGE when performing allocations.

If they use KM_SLEEP then it is possible that during a low memory
condition direct reclaim will be invoked and it may attempt to
flush dirty data to the file system.  This will result in the
thread attempting to assign a TX will block until the txg_sync
thread completes.

The end result is a deadlock with the txg_sync thread blocked on
the original I/O, and the taskq thread blocked on the txg_sync
thread.

To prevent developers from accidentally introducing this type
of deadlock.  This patch passes the TQ_PUSHPAGE_THREAD flag when
dispatching a I/O to a taskq which the txg_sync thread is waiting
on.  This causes the taskq thread to set the PF_NOFS flag in its
task struct while the zio_execute() function is being executed.
Thereby ensuring that any accidental misuse of the KM_SLEEP is
quickly causes and fixed.

Finally, this patch addresses the vdev_file_io_start() function
which may be processing a zio that the txg_sync thread is waiting
on.  In this case it is not safe to perform memory allocations
that may generate new I/O since this could cause a deadlock.
To avoid this, call taskq_dispatch() with TQ_PUSHPAGE instead of
TQ_SLEEP.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#1928
behlendorf pushed a commit that referenced this issue Jan 23, 2014
The vdev_file_io_start() function may be processing a zio that the
txg_sync thread is waiting on.  In this case it is not safe to perform
memory allocations that may generate new I/O since this could cause a
deadlock.  To avoid this, call taskq_dispatch() with TQ_PUSHPAGE
instead of TQ_SLEEP.

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue #1928
@behlendorf
Copy link
Contributor

@nedbass I've merged your initial KM_PUSHPAGE fix in to master which is clearly correct and addresses the first deadlock described above. I've going to hold off merging any of the more involved debugging patches until we have a better handle of the other similar issue you encountered.

04aa2de vdev_file_io_start() to use taskq_dispatch(TQ_PUSHPAGE)

@behlendorf behlendorf modified the milestones: 0.6.4, 0.6.3 Apr 4, 2014
ryao pushed a commit to ryao/zfs that referenced this issue Apr 9, 2014
The vdev_file_io_start() function may be processing a zio that the
txg_sync thread is waiting on.  In this case it is not safe to perform
memory allocations that may generate new I/O since this could cause a
deadlock.  To avoid this, call taskq_dispatch() with TQ_PUSHPAGE
instead of TQ_SLEEP.

Signed-off-by: Ned Bass <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#1928
@ryao
Copy link
Contributor

ryao commented Jun 19, 2014

@nedbass 6f9548c should have fixed this.

@behlendorf
Copy link
Contributor

@ryao I suspect you're right, but I'll need to spend more time looking at this to say for certain. However, we have stopped seeing this failure which does suggest it was fixed.

@behlendorf behlendorf removed this from the 0.6.4 milestone Oct 29, 2014
@behlendorf
Copy link
Contributor

Closing. This was almost certainly addressed by openzfs/spl@a3c1eb7.

@dweeezil
Copy link
Contributor

dweeezil commented May 7, 2015

I've been working on a system with @eolson78 which shows virtually identical delay events and very similar stacks. It can be reproduced rather easily on a fairly large EC2 instance to which a pair of NFS clients are running the "fileserver" workload from filebench and also on which the snapshots and zfs send operations are being performed.

The blocked process stacks from syslog and the event log are logs area available at https://gist.github.com/dweeezil/c17b9b759935bec0045a.

The installation of ZFS is mostly 0.6.3 but with the mutex_exit serialization and "pipeline invoke next stage" patches added as well as a few other post-0.6.3 patches. It should very closely match the code at https://github.com/dweeezil/zfs/tree/softnas and https://github.com/dweeezil/spl/tree/softnas.

After the IO is blocked for a sufficiently long time, the NFS clients are disconnected and once that happens, the system returns to normal operation fairly quickly.

This may certainly be a completely unrelated problem but since it looked so similar to this issue, I figured a followup to it would be warranted even though the issue is closed.

dweeezil pushed a commit to dweeezil/zfs that referenced this issue May 7, 2015
Record the type, priority, and reexecute fields of a ZIO in the
zevent log.  This may provide helpful debugging information if
a ZIO is hung.

Signed-off-by: Ned Bass <[email protected]>

Issue openzfs#1928
dweeezil pushed a commit to dweeezil/zfs that referenced this issue May 7, 2015
If a zio hangs log its address in the zevent to we can chase it with a
kernel debugger.  This is a debug-only patch since we wouldn't normally
want user tools to expose kernel memory addresses.

Signed-off-by: Ned Bass <[email protected]>

Issue openzfs#1928
@behlendorf
Copy link
Contributor

This is probably unrelated, but I thought it worth mentioning that EC2 does do things like IO limiting. Perhaps they're doing something else in their IO stack which is causing the issue.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

@dweeezil
Copy link
Contributor

dweeezil commented May 8, 2015

@behlendorf I'm not sure it's unrelated at all. I've become concerned about either the Xen blkfront driver, itself, or a related problem on its mating driver on the back end (rate limiting?).

During one of the apparent hang periods, I noticed the inflight counters for the (single vdev) xvd device seemingly "stuck":

 ==> /sys/block/xvdg/inflight <==
       3       10

and a suspiciously matching counter in the io kstat for the pool:

# cat /proc/spl/kstat/zfs/pserver/io
11 3 0x00 1 80 8497935534207 180183974038120
nread    nwritten reads    writes   wtime    wlentime wupdate  rtime    rlentime rupdate  wcnt     rcnt    
11206622687232 5908279152640 130835592 63261358 148450349353521 2347351960395491282 180183973883389 149605975852428 2347351960395491282 180183973883642 5580     13 

I posted the followup to this closed issue because it appeared to almost precisely match @nedbass' original data points: stacks, zio pipeline & flags, etc.

It just dawned on me that I might try some non-ZFS I/O to the same device during these periods where the I/O seems to be hung up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants