Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

txg_sync thread stuck in zio_wait() for more than 12 hours with no pending/running IO #2850

Closed
thegreatgazoo opened this issue Oct 29, 2014 · 16 comments

Comments

@thegreatgazoo
Copy link

It happened on CentOS 6.5 2.6.32-431.29.2, ZFS version 0.6.3-1 srcversion 533BB7E5866E52F63B9ACCB. The txg_sync thread's hung for more than 12 hours now:

# cat /proc/`pidof txg_sync`/stack
[<ffffffffa011b41c>] cv_wait_common+0x8c/0x100 [spl]
[<ffffffffa011b4a8>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa02610ab>] zio_wait+0xfb/0x1b0 [zfs]
[<ffffffffa020ee95>] spa_sync+0x455/0xa60 [zfs]
[<ffffffffa021b916>] txg_sync_thread+0x2e6/0x510 [zfs]
[<ffffffffa0116c2f>] thread_generic_wrapper+0x5f/0x70 [spl]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

But /proc/spl/kstat/zfs/lustre-ost2/io showed wcnt=0 and rcnt=0, which I took to mean that there's no ZIO queued at vdev (the pool sits on a single /dev/vda). The /proc/diskstats also showed 0 I/Os currently in progress. I checked zpool events -v, and found nothing about any zio. The last event on the pool was:

Oct 28 2014 21:47:00.346936756 ereport.fs.zfs.config.sync
        class = "ereport.fs.zfs.config.sync"
        ena = 0x1c00468656200001
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0xdde5fa1af9ab382a
        (end detector)
        pool = "lustre-ost2"
        pool_guid = 0xdde5fa1af9ab382a
        pool_context = 0x0
        pool_failmode = "wait"
        time = 0x54507144 0x14add5b4
        eid = 0x38

But the hang began at about Oct 29 2014 01:40.

The system still has 1/3 memory free, of 1.9G total; arcsz is only 20M. Swap is not on ZFS. So it's probably not a VM related deadlock.

I hit this several times already, but this time I'd be able to keep the system live for as long as it'd take to trouble shoot. Any hint on debugging?

@behlendorf
Copy link
Contributor

I've seen this rarely and it's been a hard to reproduce. I'd stack by dumping all the kernel stacks and then taking a close look at the various z_* processes. The outstanding zio may be hung in the pipeline for some non-obvious reason.

@thegreatgazoo
Copy link
Author

I checked all the ZFS threads. Only z_null_iss/0 looked suspicious - it's always in running state but its stack always stays the same:

[<ffffffffa0117686>] taskq_thread+0x3c6/0x3f0 [spl]
[<ffffffff8109abf6>] kthread+0x96/0xa0
[<ffffffff8100c20a>] child_rip+0xa/0x20
[<ffffffffffffffff>] 0xffffffffffffffff

    PID USER  PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+    COMMAND
19582 root      39  19     0         0       0    R      8.0       0.0     85:17.34  z_null_iss/0                      

@behlendorf
Copy link
Contributor

This might be another instance of #2523 which has been difficult to pin down. Usually in this case you'll be able to identify a process spinning in mutex_unlock_slowpath.

@thegreatgazoo
Copy link
Author

Thanks, I'll try /pull/2828. Good news is we hit it quite often (at about 6%) during a particular test.

@behlendorf
Copy link
Contributor

I've been looking for a good reproducer. What's the easiest way to hit this in a VM with Lustre?

@thegreatgazoo
Copy link
Author

It's most often hit in sanity test_132, more details in LU-5242. Also I was able to hit it a couple of times with VMs configured with only 1 CPU, so it's llikely not caused by any race.

@justinkb
Copy link

justinkb commented Nov 4, 2014

Is this kernel log a symptom of this bug?

Nov 04 23:13:25 nu kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [txg_sync:501]
Nov 04 23:13:25 nu kernel: Modules linked in: rfcomm bnep cfg80211 ecb btusb bluetooth rfkill nvidia(PO) snd_hda_codec_hdmi cdc_acm evdev mousedev mac_hid kvm_amd kvm pcspkr amd64_edac_mod edac_core k10temp edac_mce_amd snd_hda_intel snd_hda_controller sp5100_tco i2c_piix4 snd_hda_codec tg3 snd_hwdep drm snd_pcm ptp pps_core snd_timer libphy snd hwmon soundcore i2c_core shpchp serio tpm_infineon tpm_tis tpm acpi_cpufreq button processor zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) sg ext4 crc16 mbcache jbd2 hid_generic usbhid hid sd_mod crc_t10dif crct10dif_common uas usb_storage ahci ohci_pci libahci ohci_hcd ehci_pci ehci_hcd libata usbcore scsi_mod usb_common
Nov 04 23:13:25 nu kernel: CPU: 1 PID: 501 Comm: txg_sync Tainted: P           O L 3.17.2-1-ARCH #1
Nov 04 23:13:25 nu kernel: Hardware name: HP ProLiant MicroServer, BIOS O41     10/01/2013
Nov 04 23:13:25 nu kernel: task: ffff8803ff5be4a0 ti: ffff8803fba5c000 task.ti: ffff8803fba5c000
Nov 04 23:13:25 nu kernel: RIP: 0010:[<ffffffff8153bc75>]  [<ffffffff8153bc75>] _raw_spin_lock+0x35/0x40
Nov 04 23:13:25 nu kernel: RSP: 0018:ffff8803fba5f940  EFLAGS: 00000206
Nov 04 23:13:25 nu kernel: RAX: 0000000000003812 RBX: 0000000000000046 RCX: 0000000000000000
Nov 04 23:13:25 nu kernel: RDX: 0000000000000002 RSI: 0000000000000202 RDI: ffff8802c0f05fe4
Nov 04 23:13:25 nu kernel: RBP: ffff8803fba5f940 R08: 0000000000000202 R09: 0000000000000000
Nov 04 23:13:25 nu kernel: R10: 00000000001e8480 R11: ffff8802998b6581 R12: ffff8803fba5f8c8
Nov 04 23:13:25 nu kernel: R13: ffff8803ff5be4a0 R14: ffffffff8181b500 R15: 00000000000145c0
Nov 04 23:13:25 nu kernel: FS:  00007fb566ffd700(0000) GS:ffff88041fc80000(0000) knlGS:0000000000000000
Nov 04 23:13:25 nu kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov 04 23:13:25 nu kernel: CR2: 00007fb57c100000 CR3: 0000000001814000 CR4: 00000000000007e0
Nov 04 23:13:25 nu kernel: Stack:
Nov 04 23:13:25 nu kernel:  ffff8803fba5f998 ffffffff8153a05a ffff88040877be88 0000000000000202
Nov 04 23:13:25 nu kernel:  ffff8803fba5f988 000000007e705eaf ffff8802c0f05fe0 ffff8802c0f05fe0
Nov 04 23:13:25 nu kernel:  ffff8802c0f06010 0000000000000001 0000000000000002 ffff8803fba5f9b0
Nov 04 23:13:25 nu kernel: Call Trace:
Nov 04 23:13:25 nu kernel:  [<ffffffff8153a05a>] __mutex_lock_slowpath+0x9a/0x250
Nov 04 23:13:25 nu kernel:  [<ffffffff8153a227>] mutex_lock+0x17/0x30
Nov 04 23:13:25 nu kernel:  [<ffffffffa020fc44>] cv_wait_common+0xb4/0x1b0 [spl]
Nov 04 23:13:25 nu kernel:  [<ffffffff810af6a0>] ? __wake_up_sync+0x20/0x20
Nov 04 23:13:25 nu kernel:  [<ffffffffa020fd98>] __cv_wait_io+0x18/0x20 [spl]
Nov 04 23:13:25 nu kernel:  [<ffffffffa03331db>] zio_wait+0x11b/0x1f0 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa033335b>] zio_free+0xab/0x100 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa03347be>] zio_free_zil+0xe/0x10 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa032e52f>] zil_sync+0x1bf/0x4a0 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02d815e>] ? spa_taskq_dispatch_ent+0x8e/0xd0 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa032f9b2>] ? zio_taskq_dispatch+0x92/0xa0 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa03334cf>] ? zio_nowait+0x11f/0x1c0 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa029de5a>] dmu_objset_sync+0x25a/0x340 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa029bc70>] ? redundant_metadata_changed_cb+0x20/0x20 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa029df40>] ? dmu_objset_sync+0x340/0x340 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02aebe1>] dsl_dataset_sync+0x41/0x50 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02bcc68>] dsl_pool_sync+0xa8/0x470 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02d7730>] spa_sync+0x470/0xb90 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02e8aaa>] txg_sync_thread+0x3da/0x680 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa02e86d0>] ? txg_delay+0x150/0x150 [zfs]
Nov 04 23:13:25 nu kernel:  [<ffffffffa0207cea>] thread_generic_wrapper+0x7a/0x90 [spl]
Nov 04 23:13:25 nu kernel:  [<ffffffffa0207c70>] ? __thread_exit+0xa0/0xa0 [spl]
Nov 04 23:13:25 nu kernel:  [<ffffffff8108e06a>] kthread+0xea/0x100
Nov 04 23:13:25 nu kernel:  [<ffffffff8108df80>] ? kthread_create_on_node+0x1b0/0x1b0
Nov 04 23:13:25 nu kernel:  [<ffffffff8153c6fc>] ret_from_fork+0x7c/0xb0
Nov 04 23:13:25 nu kernel:  [<ffffffff8108df80>] ? kthread_create_on_node+0x1b0/0x1b0
Nov 04 23:13:25 nu kernel: Code: 20 c8 00 00 48 89 e5 b8 00 01 00 00 f0 66 0f c1 07 0f b6 cc 38 c1 75 06 5d c3 0f 1f 40 00 b8 00 80 00 00 eb 08 90 f3 90 83 e8 01 <74> f1 0f b6 17 38 ca 75 f2 5d c3 66 66 66 66 90 55 48 89 e5 9c

or should I open a separate bug for this?

Happened under low load as far as I can tell, couple of days in a row. Required actually hard powering down the machine to reboot every time (X hang after it, which I needed, so...).

Might be worth mentioning I've had this running smoothly on Arch Linux for ages, until kernel 3.17 came out and the zfs package got updated to a newer git tag (to support 3.17 kernels).

@behlendorf
Copy link
Contributor

@justinkb yes your issue looks to me the same. It's also being tracked under #2523.

@behlendorf
Copy link
Contributor

@thegreatgazoo if you get a chance could you verify that the fix in openzfs/spl#421 resolves this hang.

behlendorf pushed a commit to openzfs/spl that referenced this issue Dec 19, 2014
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #421
@behlendorf
Copy link
Contributor

This issue which is a duplicate of #2523 was resolved by the following commit. Full details can be found in the commit message and related lwn article.

openzfs/spl@a3c1eb7 mutex: force serialization on mutex_exit() to fix races

dajhorn added a commit to zfsonlinux/pkg-spl that referenced this issue Dec 20, 2014
Commit: openzfs/zfs@a3c1eb7
From: Chunwei Chen <[email protected]>
Date: Fri, 19 Dec 2014 11:31:59 +0800
Subject: mutex: force serialization on mutex_exit() to fix races

It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Backported-by: Darik Horn <[email protected]>
Closes #421

Conflicts:
        include/sys/mutex.h
behlendorf pushed a commit to openzfs/spl that referenced this issue Dec 23, 2014
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes #421
@thegreatgazoo
Copy link
Author

@behlendorf We tested /pull/2828 which didn't eliminate the timeouts. We're now testing openzfs/spl@a3c1eb7, will report back.

@behlendorf
Copy link
Contributor

That's what I'd expect. It was unlikely that you were hitting the issue which the first patch addressed. However, I'm confident the second patch will resolve the issue.

I'd also suggest you audit the Lustre code for similar issues. It wouldn't at all surprise me if Lustre is using a mutex in a similar unsafe fashion. In fact, we've seen very similar symptoms on some Lustre clients, and those symptoms went away by enabling mutex debugging which is consistent with this kind of issue.

@thegreatgazoo
Copy link
Author

y, I just screamed at everyone to scrutinize their code.

@ryao
Copy link
Contributor

ryao commented Feb 10, 2015

@behlendorf @justinkb's issue is not a duplicate of #2523. Instead, it is a duplicate of #3091, provided that it occurred on a multi-socket system.

@lintonv
Copy link

lintonv commented Feb 11, 2015

@behlendorf @ryao
Has this bug #2850 been resolved? It is marked as closed but I don't see a fix here. If the patch is in #2523, I can confirm that patch does not fix, at least a variant of this issue.

Let me explain -

Our Setup:
25 ZFS filesystems
Running IO simultaneously to all 25 FSs
run ZFS unmount to cleanly bring down the situation (we are testing a fail-over situation)

In the above setup for ZFS umount, we see txg_sync hung for long using a lot of CPU and memory.

Do you think the two are related? Should I open a new bug for it?

@behlendorf
Copy link
Contributor

Could you file a new issue for the remaining bug. That would make it easier to track.

ryao pushed a commit to ryao/spl that referenced this issue Feb 19, 2015
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the openzfs/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

openzfs/zfs#401
openzfs/zfs#2523
openzfs/zfs#2679
openzfs/zfs#2684
openzfs/zfs#2704
openzfs/zfs#2708
openzfs/zfs#2517
openzfs/zfs#2827
openzfs/zfs#2850
openzfs/zfs#2891
openzfs/zfs#2897
openzfs/zfs#2247
openzfs/zfs#2939

Signed-off-by: Chunwei Chen <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Signed-off-by: Richard Yao <[email protected]>
Closes openzfs#421

Conflicts:
	include/sys/mutex.h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants