-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processes hang in 'D' after issuing fsync #7484
Comments
An addendum: Over the weekend I had my first incident where the kernel detects userspace processes have blocked for more than 120 seconds, as in #7038:
No I/O errors that made it to the kernel ring buffer have occurred since last reboot, but the checksum error counter for disk A16 has increased by one. |
I think I've just caught, twice, yesterday 05/06/2018 and some months ago, 11/24/2017, under two different versions of ZFS, gentoo zfs-0.7.1-r1 and v0.7.7-r0-gentoo, by a really similar issue. The problem arose in both cases while bacula-dir was committing a large "INSERT" to its postgresql database, at the end of backups with a large number of files. In both cases (I've records) the system was reading and writing, apparently correctly, to my ZFS pool, while it was impossible to "sync" the system, any "sync" command was hanging, even trying to shutdown or reboot the system without an HARD RESET was impossible. The only "hanging" process, impossible to kill, was the postgres process trying to commit the INSERT, probably blocked into a library "fsync()" call. It was impossible to kill the postgres process, so it was actually locked into doing a system call, in kernel mode. Of course the postgres database files were managed by ZFS and this processs was the only one hanging in the system, before trying to manually "sync" the system. Unfortunately I handn't in both cases the possibility to perform a more detailed analysis, but the problem really looks like a ZFS problem. I've another bacula backup system doing the same job, with the same versions of ZFS, along the same period of time. It didn't show any problem, up to now. The only difference between the two systems is the kernel configuration (same versions in the same periods had been kept). One, the one which hanged, has a PREEMPT kernel, the other one is not PREEMPT at all. May be related? In the meanwhile I'm trying to switch to a NOT-PREEMPT kernel to see if it will make any difference, but the probability of the event, in my case, seems really small. |
Just had the same problem on another system, without any other evident error, while doing a relatively small "zfs send" (64Gb from a 3Tb raidz1 pool). A few lines paste into a vi buffer hanged, vi of a file in my /home, a zfs filesystem from the same pool. Suddenly it become impossible to "sync", the kernel process "kworker/u16:1" went to 100% CPU and any attempt to sync hanged. The only way to get back control on the system was an HARD RESET, again. The kernel is a gentoo 4.9.95-gentoo stable kernel, NO PREEMPT this time, with the last stable gentoo sys-fs/zfs-0.7.8. It really looks like some sort of "deadlock" inside the kernel, if not ZFS. |
Let me add my 2 cents to this issue. I've another "bacula" server using ZFS, same configuration, same kernel, same ZFS versions, same workload, which never experimented this issue. The only difference? As this server use "disks devices" and not "partition devices" ZFS switched the linux kernel I/O scheduler from the [cfq] default to the [noop] scheduler. Now, may this be a linux kernel cfq-scheduler bug which pop up with greater probability under a ZFS I/O workload? From the other side I scrubbed (and did a "zfs send" complete backup) of my pools. No errors, "smart" is CLEAN. And while "sync" hangs I can't see any other problem, beside a kernel worker using 100% CPU. Which may actually indicates a kernel bug, isn't? |
I am experiencing this with the noop scheduler. |
Well, I've in test the "noop" scheduler now and I didn't experience a new "sync hang" so far. vm.dirty_bytes = 268435456 Let see what happens, we have not enough info so far, but I bet it really is a kernel issue, not a ZFS issue. |
Hello, here another one with same problem. system: workload: problem: somebody knows how to try to get system traces? |
My systems (two) were using cfq. The hang seems more probable with cfq than with noop, for what I've verified. My systems didn't show again the hang using noop and the "dirty" change I've applied, so far. |
I have ruled out I/O errors as being the cause of this problem (and thus it being related to #7469), as it has happened several times without I/O errors. I have downgraded my affected system to 0.7.3 to see if the problem recurs. |
Happened again, same pattern, postgres doing a big INSERT over a zfs filesystem, gentoo stable kernel 4.9.95, ZFS v0.7.8-r0-gentoo, a kworker thread at 100%: root 12990 2 35 20:30 ? 00:16:54 [kworker/u32:5] and the kworker thread stack (cat /proc/12990/stack) looks really weird to my eyes:
This time the I/O scheduler was noop (it looks cfq, noop or deadline doesn't actually make any difference). I don't know if this may be related, but on this particular system. the "zfs" pool also show this problem: zdb -u zfs By the way, of course, the only way to escape from this "trap" was a "reboot -nf". |
Seems similar to what I've been seeing in #7425 |
Downgrading to 0.7.3 did not get rid of the problem, although it took twelve days for it to rear its head again. |
Upgrading to 0.7.9 did not get rid of the problem, which manifested faster than I've ever seen before: in under eight hours, the failure occurred. |
are you using vmware or other virt tech?
…-uxio-
El 26 jun 2018, a las 21:18, waoki ***@***.***> escribió:
Upgrading to 0.7.9 did not get rid of the problem, which manifested faster than I've ever seen before: in under eight hours, the failure occurred.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
No, it's running on the bare metal. |
We're debugging a similar issue @datto on
In
This seems to explain why the TXG syncing is stuck. The tricky part is figuring out what is preventing the zvol thread from succesfuly grabing that |
Had a hang after less than five hours today, but we may be seeing a different problem. Contents of
|
Got this today, complete with the |
Well, it seems an extremely rare and subtle bug, if even it is a bug. It doesn't show on my systems, gentoo, 4.14.105-r1 and 4.19.23/27-r1, zfs-0.7.12 from about six months. Now I'm starting to ask myself, and to this thread, if there is a common source of this event between the users which have been reported the problem here. For example, I've a postgres db running over zfs, but, from other side, I have seen the event simply starting vi (it automatically sync for .swp files) over my zfs home file system on another system. For this reason I moved the vi temporary directory over /tmp, an ext4 filesystem. It looks, to my eyes, the probabilty to see this event increase as the number of sync increase. Maybe a way to reproduce the problem? For example calling fsync(fd) from a loop in a test program with a filedescriptor open over a ZFS filesystem? |
New event, #2-2019 in five months: kernel 4.14.105-gentoo-r1 #1 SMP (elevator=deadline) 3631 root 20 0 0 0 0 R 100.0 0.0 793:51.76 kworker/u32:4 Again while bacula was updating its postgres database, a 12Gb db on a zfs filesystem. Any sync command was hanging and: cat /proc/3631/stack This incident happened on this system: 24/11/2017 #1-2017 under different kernels and different ZFS/SPL versions, from v0.7.7-r0-gentoo to to v0.7.12-r0-gentoo,. Just hoping it will help to debug the really boring problem. |
As an experiment, I changed three filesystems from sync=default to sync=disabled. The system has gone 11 days so far without this problem surfacing. Given normal failure rates this is unlikely to be by chance. The system in question does not have a SIL. |
Correct. I moved the system that was giving me the most trouble to sync=disabled, and it's been up 274 days without issue instead of 24-48 hours between lockups. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Describe the problem you're observing
About once a week, processes accessing ZFS hang in the 'D' state. zfs-auto-snapshot continues to function, and access to the zpool is still possible. Processes appear to hang when issuing
sync
orfsync
. Processes can still read & write so long as they don't issuefsync
. In very limited testing, data written prior tofsync
is readable by processes that haven't hung. All zpools on the system are affected.It looks like we started seeing the problem in December, after upgrading from 0.7.3 to 0.7.4. (I've looked as far back as August and don't see evidence of this happening prior to December.) Frequency has increased to one failure a week recently, although that may be due to increased workload.
Hardware is a Supermicro SSG-6048R-E1CR36L with 256 GB ECC RAM, E5-2609v3 CPU and an assortment of Seagate 4TB and 8TB nearline disks (formerly the Enterprise Capacity line). Linux is running on the bare metal, not under a hypervisor.
The problem appears to happen under heavy writing. I have not had it happen under heavy
zfs send
operations or heavy reading.It is possible that this problem is related to or a duplicate of two other open issues. I had a difficult time deciding whether to add this as a comment to one or to open a new issue.
One possibility is #7469. I have gotten occasional I/O errors from the disks in one of my pools, but the timing of the I/O errors doesn't match the onset of this problem. In the case I'm looking at now, the last I/O error appears to have happened twelve hours before onset of this problem. I am not finding references to
txg_sync
in the logs or stack.The other possibility is #7038. I'm not doing high-frequency creation, destruction or cloning, nor am I getting kernel messages about user tasks blocking. The only such tasks are
INFO: task z_zvol:17573 blocked for more than 120 seconds
several days before onset of symptoms. Stack dumps tended to differ from the examples there. I do see a kworker process (presentlykworker/u25:4
) eating 100% of one CPU. The system load pushes ever higher.Describe how to reproduce the problem
I have not determined how to trigger the problem, but once the problem occurs, anything that issues fsync on a file on ZFS will hang unkillably forever. For a trivial example, this hangs unless complied with
-DNO_HANG_PLZ
, in which case it has no problem writing data.Include any warning/errors/backtraces from the system logs
The spare
B1
changed tosdz
after a reboot.Kernel log is truncated to time of first error. Onset of problem is at least an hour and thirteen minutes after timestamp 287375, as I have evidence of successful fsync in another program.
Some stack traces of hung processes:
The text was updated successfully, but these errors were encountered: