txg_sync thread stuck in zio_wait() for more than 12 hours with no pending/running IO #2850

thegreatgazoo · 2014-10-29T21:34:46Z

It happened on CentOS 6.5 2.6.32-431.29.2, ZFS version 0.6.3-1 srcversion 533BB7E5866E52F63B9ACCB. The txg_sync thread's hung for more than 12 hours now:

# cat /proc/`pidof txg_sync`/stack
[<ffffffffa011b41c>] cv_wait_common+0x8c/0x100 [spl]
[<ffffffffa011b4a8>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa02610ab>] zio_wait+0xfb/0x1b0 [zfs]
[<ffffffffa020ee95>] spa_sync+0x455/0xa60 [zfs]
[<ffffffffa021b916>] txg_sync_thread+0x2e6/0x510 [zfs]
[<ffffffffa0116c2f>] thread_generic_wrapper+0x5f/0x70 [spl]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

But /proc/spl/kstat/zfs/lustre-ost2/io showed wcnt=0 and rcnt=0, which I took to mean that there's no ZIO queued at vdev (the pool sits on a single /dev/vda). The /proc/diskstats also showed 0 I/Os currently in progress. I checked zpool events -v, and found nothing about any zio. The last event on the pool was:

Oct 28 2014 21:47:00.346936756 ereport.fs.zfs.config.sync
        class = "ereport.fs.zfs.config.sync"
        ena = 0x1c00468656200001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xdde5fa1af9ab382a
        (end detector)
        pool = "lustre-ost2"
        pool_guid = 0xdde5fa1af9ab382a
        pool_context = 0x0
        pool_failmode = "wait"
        time = 0x54507144 0x14add5b4
        eid = 0x38

But the hang began at about Oct 29 2014 01:40.

The system still has 1/3 memory free, of 1.9G total; arcsz is only 20M. Swap is not on ZFS. So it's probably not a VM related deadlock.

I hit this several times already, but this time I'd be able to keep the system live for as long as it'd take to trouble shoot. Any hint on debugging?

The text was updated successfully, but these errors were encountered:

behlendorf · 2014-10-29T22:03:06Z

I've seen this rarely and it's been a hard to reproduce. I'd stack by dumping all the kernel stacks and then taking a close look at the various z_* processes. The outstanding zio may be hung in the pipeline for some non-obvious reason.

thegreatgazoo · 2014-10-30T00:24:21Z

I checked all the ZFS threads. Only z_null_iss/0 looked suspicious - it's always in running state but its stack always stays the same:

[<ffffffffa0117686>] taskq_thread+0x3c6/0x3f0 [spl]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

    PID USER  PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+    COMMAND
19582 root      39  19     0         0       0    R      8.0       0.0     85:17.34  z_null_iss/0

behlendorf · 2014-10-30T01:49:51Z

This might be another instance of #2523 which has been difficult to pin down. Usually in this case you'll be able to identify a process spinning in mutex_unlock_slowpath.

thegreatgazoo · 2014-10-30T04:02:14Z

Thanks, I'll try /pull/2828. Good news is we hit it quite often (at about 6%) during a particular test.

behlendorf · 2014-10-30T21:39:38Z

I've been looking for a good reproducer. What's the easiest way to hit this in a VM with Lustre?

thegreatgazoo · 2014-10-31T04:06:17Z

It's most often hit in sanity test_132, more details in LU-5242. Also I was able to hit it a couple of times with VMs configured with only 1 CPU, so it's llikely not caused by any race.

justinkb · 2014-11-04T22:59:52Z

Is this kernel log a symptom of this bug?

Nov 04 23:13:25 nu kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [txg_sync:501]
Nov 04 23:13:25 nu kernel: Modules linked in: rfcomm bnep cfg80211 ecb btusb bluetooth rfkill nvidia(PO) snd_hda_codec_hdmi cdc_acm evdev mousedev mac_hid kvm_amd kvm pcspkr amd64_edac_mod edac_core k10temp edac_mce_amd snd_hda_intel snd_hda_controller sp5100_tco i2c_piix4 snd_hda_codec tg3 snd_hwdep drm snd_pcm ptp pps_core snd_timer libphy snd hwmon soundcore i2c_core shpchp serio tpm_infineon tpm_tis tpm acpi_cpufreq button processor zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) sg ext4 crc16 mbcache jbd2 hid_generic usbhid hid sd_mod crc_t10dif crct10dif_common uas usb_storage ahci ohci_pci libahci ohci_hcd ehci_pci ehci_hcd libata usbcore scsi_mod usb_common
Nov 04 23:13:25 nu kernel: CPU: 1 PID: 501 Comm: txg_sync Tainted: P           O L 3.17.2-1-ARCH #1
Nov 04 23:13:25 nu kernel: Hardware name: HP ProLiant MicroServer, BIOS O41     10/01/2013
Nov 04 23:13:25 nu kernel: task: ffff8803ff5be4a0 ti: ffff8803fba5c000 task.ti: ffff8803fba5c000
Nov 04 23:13:25 nu kernel: RIP: 0010:[<ffffffff8153bc75>]  [<ffffffff8153bc75>] _raw_spin_lock+0x35/0x40
Nov 04 23:13:25 nu kernel: RSP: 0018:ffff8803fba5f940  EFLAGS: 00000206
Nov 04 23:13:25 nu kernel: RAX: 0000000000003812 RBX: 0000000000000046 RCX: 0000000000000000
Nov 04 23:13:25 nu kernel: RDX: 0000000000000002 RSI: 0000000000000202 RDI: ffff8802c0f05fe4
Nov 04 23:13:25 nu kernel: RBP: ffff8803fba5f940 R08: 0000000000000202 R09: 0000000000000000
Nov 04 23:13:25 nu kernel: R10: 00000000001e8480 R11: ffff8802998b6581 R12: ffff8803fba5f8c8
Nov 04 23:13:25 nu kernel: R13: ffff8803ff5be4a0 R14: ffffffff8181b500 R15: 00000000000145c0
Nov 04 23:13:25 nu kernel: FS:  00007fb566ffd700(0000) GS:ffff88041fc80000(0000) knlGS:0000000000000000
Nov 04 23:13:25 nu kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 04 23:13:25 nu kernel: CR2: 00007fb57c100000 CR3: 0000000001814000 CR4: 00000000000007e0
Nov 04 23:13:25 nu kernel: Stack:
Nov 04 23:13:25 nu kernel:  ffff8803fba5f998 ffffffff8153a05a ffff88040877be88 0000000000000202
Nov 04 23:13:25 nu kernel:  ffff8803fba5f988 000000007e705eaf ffff8802c0f05fe0 ffff8802c0f05fe0
Nov 04 23:13:25 nu kernel:  ffff8802c0f06010 0000000000000001 0000000000000002 ffff8803fba5f9b0
Nov 04 23:13:25 nu kernel: Call Trace:
Nov 04 23:13:25 nu kernel:  [<ffffffff8153a05a>] __mutex_lock_slowpath+0x9a/0x250
Nov 04 23:13:25 nu kernel:  [<ffffffff8153a227>] mutex_lock+0x17/0x30
Nov 04 23:13:25 nu kernel:  [<ffffffffa020fc44>] cv_wait_common+0xb4/0x1b0 [spl]
Nov 04 23:13:25 nu kernel:  [<ffffffff810af6a0>] ? __wake_up_sync+0x20/0x20
Nov 04 23:13:25 nu kernel:  [<ffffffffa020fd98>] __cv_wait_io+0x18/0x20 [spl]
Nov 04 23:13:25 nu kernel:  [<ffffffffa03331db>] zio_wait+0x11b/0x1f0 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa033335b>] zio_free+0xab/0x100 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa03347be>] zio_free_zil+0xe/0x10 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa032e52f>] zil_sync+0x1bf/0x4a0 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02d815e>] ? spa_taskq_dispatch_ent+0x8e/0xd0 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa032f9b2>] ? zio_taskq_dispatch+0x92/0xa0 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa03334cf>] ? zio_nowait+0x11f/0x1c0 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa029de5a>] dmu_objset_sync+0x25a/0x340 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa029bc70>] ? redundant_metadata_changed_cb+0x20/0x20 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa029df40>] ? dmu_objset_sync+0x340/0x340 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02aebe1>] dsl_dataset_sync+0x41/0x50 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02bcc68>] dsl_pool_sync+0xa8/0x470 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02d7730>] spa_sync+0x470/0xb90 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02e8aaa>] txg_sync_thread+0x3da/0x680 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02e86d0>] ? txg_delay+0x150/0x150 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa0207cea>] thread_generic_wrapper+0x7a/0x90 [spl]
Nov 04 23:13:25 nu kernel:  [<ffffffffa0207c70>] ? __thread_exit+0xa0/0xa0 [spl]
Nov 04 23:13:25 nu kernel:  [<ffffffff8108e06a>] kthread+0xea/0x100
Nov 04 23:13:25 nu kernel:  [<ffffffff8108df80>] ? kthread_create_on_node+0x1b0/0x1b0
Nov 04 23:13:25 nu kernel:  [<ffffffff8153c6fc>] ret_from_fork+0x7c/0xb0
Nov 04 23:13:25 nu kernel:  [<ffffffff8108df80>] ? kthread_create_on_node+0x1b0/0x1b0
Nov 04 23:13:25 nu kernel: Code: 20 c8 00 00 48 89 e5 b8 00 01 00 00 f0 66 0f c1 07 0f b6 cc 38 c1 75 06 5d c3 0f 1f 40 00 b8 00 80 00 00 eb 08 90 f3 90 83 e8 01 <74> f1 0f b6 17 38 ca 75 f2 5d c3 66 66 66 66 90 55 48 89 e5 9c

or should I open a separate bug for this?

Happened under low load as far as I can tell, couple of days in a row. Required actually hard powering down the machine to reboot every time (X hang after it, which I needed, so...).

Might be worth mentioning I've had this running smoothly on Arch Linux for ages, until kernel 3.17 came out and the zfs package got updated to a newer git tag (to support 3.17 kernels).

behlendorf · 2014-11-06T19:36:39Z

@justinkb yes your issue looks to me the same. It's also being tracked under #2523.

behlendorf · 2014-12-18T19:05:03Z

@thegreatgazoo if you get a chance could you verify that the fix in openzfs/spl#421 resolves this hang.

It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #421

behlendorf · 2014-12-19T19:02:24Z

This issue which is a duplicate of #2523 was resolved by the following commit. Full details can be found in the commit message and related lwn article.

openzfs/spl@a3c1eb7 mutex: force serialization on mutex_exit() to fix races

Commit: openzfs/zfs@a3c1eb7 From: Chunwei Chen <[email protected]> Date: Fri, 19 Dec 2014 11:31:59 +0800 Subject: mutex: force serialization on mutex_exit() to fix races It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Backported-by: Darik Horn <[email protected]> Closes #421 Conflicts: include/sys/mutex.h

It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes #421

thegreatgazoo · 2015-01-10T19:20:10Z

@behlendorf We tested /pull/2828 which didn't eliminate the timeouts. We're now testing openzfs/spl@a3c1eb7, will report back.

behlendorf · 2015-01-11T16:24:10Z

That's what I'd expect. It was unlikely that you were hitting the issue which the first patch addressed. However, I'm confident the second patch will resolve the issue.

I'd also suggest you audit the Lustre code for similar issues. It wouldn't at all surprise me if Lustre is using a mutex in a similar unsafe fashion. In fact, we've seen very similar symptoms on some Lustre clients, and those symptoms went away by enabling mutex debugging which is consistent with this kind of issue.

thegreatgazoo · 2015-01-12T04:24:51Z

y, I just screamed at everyone to scrutinize their code.

ryao · 2015-02-10T18:42:39Z

@behlendorf @justinkb's issue is not a duplicate of #2523. Instead, it is a duplicate of #3091, provided that it occurred on a multi-socket system.

lintonv · 2015-02-11T14:21:22Z

@behlendorf @ryao
Has this bug #2850 been resolved? It is marked as closed but I don't see a fix here. If the patch is in #2523, I can confirm that patch does not fix, at least a variant of this issue.

Let me explain -

Our Setup:
25 ZFS filesystems
Running IO simultaneously to all 25 FSs
run ZFS unmount to cleanly bring down the situation (we are testing a fail-over situation)

In the above setup for ZFS umount, we see txg_sync hung for long using a lot of CPU and memory.

Do you think the two are related? Should I open a new bug for it?

behlendorf · 2015-02-11T15:15:00Z

Could you file a new issue for the remaining bug. That would make it easier to track.

It is known that mutexes in Linux are not safe when using them to synchronize the freeing of object in which the mutex is embedded: http://lwn.net/Articles/575477/ The known places in ZFS which are suspected to suffer from the race condition are zio->io_lock and dbuf->db_mtx. * zio uses zio->io_lock and zio->io_cv to synchronize freeing between zio_wait() and zio_done(). * dbuf uses dbuf->db_mtx to protect reference counting. This patch fixes this kind of race by forcing serialization on mutex_exit() with a spin lock, making the mutex safe by sacrificing a bit of performance and memory overhead. This issue most commonly manifests itself as a deadlock in the zio pipeline caused by a process spinning on the damaged mutex. Similar deadlocks have been reported for the dbuf->db_mtx mutex. And it can also cause a NULL dereference or bad paging request under the right circumstances. This issue any many like it are linked off the openzfs/zfs#2523 issue. Specifically this fix resolves at least the following outstanding issues: openzfs/zfs#401 openzfs/zfs#2523 openzfs/zfs#2679 openzfs/zfs#2684 openzfs/zfs#2704 openzfs/zfs#2708 openzfs/zfs#2517 openzfs/zfs#2827 openzfs/zfs#2850 openzfs/zfs#2891 openzfs/zfs#2897 openzfs/zfs#2247 openzfs/zfs#2939 Signed-off-by: Chunwei Chen <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Signed-off-by: Richard Yao <[email protected]> Closes openzfs#421 Conflicts: include/sys/mutex.h

behlendorf added Bug - Minor labels Oct 29, 2014

behlendorf closed this as completed Dec 19, 2014

behlendorf mentioned this issue Feb 11, 2015

ZFS lockup 0.6.3.1 #3091

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

txg_sync thread stuck in zio_wait() for more than 12 hours with no pending/running IO #2850

txg_sync thread stuck in zio_wait() for more than 12 hours with no pending/running IO #2850

thegreatgazoo commented Oct 29, 2014

behlendorf commented Oct 29, 2014

thegreatgazoo commented Oct 30, 2014

behlendorf commented Oct 30, 2014

thegreatgazoo commented Oct 30, 2014

behlendorf commented Oct 30, 2014

thegreatgazoo commented Oct 31, 2014

justinkb commented Nov 4, 2014

behlendorf commented Nov 6, 2014

behlendorf commented Dec 18, 2014

behlendorf commented Dec 19, 2014

thegreatgazoo commented Jan 10, 2015

behlendorf commented Jan 11, 2015

thegreatgazoo commented Jan 12, 2015

ryao commented Feb 10, 2015

lintonv commented Feb 11, 2015

behlendorf commented Feb 11, 2015

txg_sync thread stuck in zio_wait() for more than 12 hours with no pending/running IO #2850

txg_sync thread stuck in zio_wait() for more than 12 hours with no pending/running IO #2850

Comments

thegreatgazoo commented Oct 29, 2014

behlendorf commented Oct 29, 2014

thegreatgazoo commented Oct 30, 2014

behlendorf commented Oct 30, 2014

thegreatgazoo commented Oct 30, 2014

behlendorf commented Oct 30, 2014

thegreatgazoo commented Oct 31, 2014

justinkb commented Nov 4, 2014

behlendorf commented Nov 6, 2014

behlendorf commented Dec 18, 2014

behlendorf commented Dec 19, 2014

thegreatgazoo commented Jan 10, 2015

behlendorf commented Jan 11, 2015

thegreatgazoo commented Jan 12, 2015

ryao commented Feb 10, 2015

lintonv commented Feb 11, 2015

behlendorf commented Feb 11, 2015