-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
txg_sync indefinite hang, possibly on dp_config_rwlock (task txg_sync blocked for more than 120 seconds) #7598
Comments
I think we might have hit something similar with
|
@theonewolf, are you able to describe your configuration (versions, architecture, pool config, etc.), and what was done immediately before the hang, in case this is specific to any of that? I've been trying to reproduce the issue on another system with the same pool arrangement and kernel/zfs versions, though completely different hardware, by repeatedly restarting containers (which is what triggered my crash) while otherwise stressing zfs, but have had no luck. Your issue does look consistent with what I've encountered; indeed, when I rebooted that system, it hung for long periods of time at quite a few places (shutting down containers, unmounting filesystems, all those failed and timed out), appeared completely frozen for a while, and ultimately took around an hour to shut down. It then came back up with no issues. The pool that encountered this issue was raidz2 with 7 disks, 1 (second down in zpool status) was offline; lxd was using its own dataset within this pool. The create command used was The system was a Ryzen3 CPU, 16GB ECC RAM, X370 chipset. Also, I'm not sure if this is relevant at all, but as stated in the initial comment, I had left the system running for a long time in the partly-hung state. I noticed that while the datasets used for lxd were mostly hung and untouchable, the dataset I use for NAS was still accessible, at least initially. At some point, a scheduled backup of another system I'd forgot about kicked off using rsync, and transferred several GB of data to a single file before rsync on the client also hung. After the reboot of the server, no trace of that file was left; I'm not sure if this bug caused it to be lost after being written in the hung state, or if the rsync receive side happened to hang before it even opened the file then discarded the first few GB of data (which is a thing I've seen rsync do quite a few times). |
@CoJaBo this was a CentOS 7.5 system. Actually the whole system is locked up right now, I need to go reset it physically (it's a distance away). It is running on Intel Xeon's (brand new, latest socket) on a SuperMicro board with 256 GiB of ECC RAM. We had very high CPU usage while spawning containers using Our pool is even simpler: we have LVM underneath so we aren't doing RAID or anything fancy at the ZFS layer other than compression (LZ4 as well I think). |
I have similar issue on Ubuntu 18.04 with LXD when trying restart container |
Commit 93b43af inadvertently introduced the following scenario which can result in a deadlock. This issue was most easily reproduced by LXD containers using a ZFS storage backend but should be reproducible under any workload which is frequently mounting and unmounting. -- THREAD A -- spa_sync() spa_sync_upgrades() rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B -- THREAD B -- mount_fs() zpl_mount() zpl_mount_impl() dmu_objset_hold() dmu_objset_hold_flags() dsl_pool_hold() dsl_pool_config_enter() rrw_enter(&dp->dp_config_rwlock, RW_READER, tag); sget() sget_userns() grab_super() down_write(&s->s_umount); <- Waiting on C -- THREAD C -- cleanup_mnt() deactivate_super() down_write(&s->s_umount); deactivate_locked_super() zpl_kill_sb() kill_anon_super() generic_shutdown_super() sync_filesystem() zpl_sync_fs() zfs_sync() zil_commit() txg_wait_synced() <- Waiting on A Reviewed by: Alek Pinchuk <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#7598 Closes openzfs#7659 Closes openzfs#7691 Closes openzfs#7693
Commit 93b43af inadvertently introduced the following scenario which can result in a deadlock. This issue was most easily reproduced by LXD containers using a ZFS storage backend but should be reproducible under any workload which is frequently mounting and unmounting. -- THREAD A -- spa_sync() spa_sync_upgrades() rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B -- THREAD B -- mount_fs() zpl_mount() zpl_mount_impl() dmu_objset_hold() dmu_objset_hold_flags() dsl_pool_hold() dsl_pool_config_enter() rrw_enter(&dp->dp_config_rwlock, RW_READER, tag); sget() sget_userns() grab_super() down_write(&s->s_umount); <- Waiting on C -- THREAD C -- cleanup_mnt() deactivate_super() down_write(&s->s_umount); deactivate_locked_super() zpl_kill_sb() kill_anon_super() generic_shutdown_super() sync_filesystem() zpl_sync_fs() zfs_sync() zil_commit() txg_wait_synced() <- Waiting on A Reviewed by: Alek Pinchuk <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#7598 Closes openzfs#7659 Closes openzfs#7691 Closes openzfs#7693
Commit 93b43af inadvertently introduced the following scenario which can result in a deadlock. This issue was most easily reproduced by LXD containers using a ZFS storage backend but should be reproducible under any workload which is frequently mounting and unmounting. -- THREAD A -- spa_sync() spa_sync_upgrades() rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B -- THREAD B -- mount_fs() zpl_mount() zpl_mount_impl() dmu_objset_hold() dmu_objset_hold_flags() dsl_pool_hold() dsl_pool_config_enter() rrw_enter(&dp->dp_config_rwlock, RW_READER, tag); sget() sget_userns() grab_super() down_write(&s->s_umount); <- Waiting on C -- THREAD C -- cleanup_mnt() deactivate_super() down_write(&s->s_umount); deactivate_locked_super() zpl_kill_sb() kill_anon_super() generic_shutdown_super() sync_filesystem() zpl_sync_fs() zfs_sync() zil_commit() txg_wait_synced() <- Waiting on A Reviewed by: Alek Pinchuk <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#7598 Closes openzfs#7659 Closes openzfs#7691 Closes openzfs#7693
Commit 93b43af inadvertently introduced the following scenario which can result in a deadlock. This issue was most easily reproduced by LXD containers using a ZFS storage backend but should be reproducible under any workload which is frequently mounting and unmounting. -- THREAD A -- spa_sync() spa_sync_upgrades() rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B -- THREAD B -- mount_fs() zpl_mount() zpl_mount_impl() dmu_objset_hold() dmu_objset_hold_flags() dsl_pool_hold() dsl_pool_config_enter() rrw_enter(&dp->dp_config_rwlock, RW_READER, tag); sget() sget_userns() grab_super() down_write(&s->s_umount); <- Waiting on C -- THREAD C -- cleanup_mnt() deactivate_super() down_write(&s->s_umount); deactivate_locked_super() zpl_kill_sb() kill_anon_super() generic_shutdown_super() sync_filesystem() zpl_sync_fs() zfs_sync() zil_commit() txg_wait_synced() <- Waiting on A Reviewed by: Alek Pinchuk <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#7598 Closes openzfs#7659 Closes openzfs#7691 Closes openzfs#7693
System information
Describe the problem you're observing
txg_sync hung indefinitely, which has blocked most zfs commands from running as well.
In discussing the problem in IRC, it was suggested that something holding the dp_config_rwlock lock is a likely cause.
Describe how to reproduce the problem
Have not yet attempted to reproduce, but there's not a clear sequence of events that led to this.
Problem appeared suddenly when attempting to restart an lxc/lxd container (which had been running mostly unused for a long time) stored on zfs, an operation that likely led to a bunch of read/write and zfs unmount/mount operations. The zfs hang led to most of lxd hanging as well.
At present, I've left the system running in this state in case there's something else I can do to get useful info from it before rebooting. As much log info as I've been able to get is attached.
dmesg
("zen", for some reason, is the name of the lxc container that was being restarted at the time; the first line (audit) gives a timestamp reference of closer to when the hang actually happened.)
Stack/info from all hung (D) processes
cat /proc/spl/kstat/zfs/dbgmsg
(/sys/module/zfs/parameters/zfs_dbgmsg_enable was not enabled until well after the hang)
The text was updated successfully, but these errors were encountered: