-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel panic from zio_decompress in lzjb_decompress under heavy I/O #2709
Comments
I'll note that you can set the l2arc_nocompress to a nonzero value disable l2arc compression. That said, I'm rather curious as to the history of this problem. Is it 3.17 specific? Is dedup enabled? The l2arc compression feature has been around for awhile and I'm not aware of other problems being reported related to it. I'm guessing the memset call is actually a "bzero()" in the source code from the looks of things but the stack trace isn't making it very easy to figure out which one. My best guess is the first one in It sounds like this is something you can reproduce pretty easily. Could you try compiling your spl and zfs modules with |
I've been having this problem for a while, it's not just been on the 3.17 kernel. Dedup is turned on as well. So I enabled --enable-debug and it's now crashing trying to bring the mirror back to a consistent state... I think it finally crashed at the right point that one of the mirror devices got corrupted.
I zero'd out the first and last 100MB of the drive and did a "replace" on the pool with the same device (after running the drive through some diagnostics). The resilver just completed and I am now back to rsync'ing to see if the debug build provides any additional information if it crashes. |
Fix in #4790. |
Closing, fixed by #4790. |
I'm experiencing a crash on a system with 8GB of RAM when using compression. I've got a basic mirror setup (2 devices) with no log device but I do have a cache device.
I traced back the code history to where the cache started getting compressed regardless of the volume preferences and tried disabling caching on it by patching the ZFS driver to never compress on the cache device, but it still led to a crash. To do this, I patched include/zfs/sys/dmu_objset.h to have DMU_OS_IS_L2COMPRESSIBLE(os) always return false (0).
Finally, I simply removed the cache device and it seemed like things were stable for a bit but ultimately the system is still crashing.
At this point I'm running with a watchdog that checks for the specific "zio_decompress" kernel panic message to keep the system up... this is just a personal server that doesn't need to be reliable or anything. The watchdog is just so that the system recovers after the kernel panic.
At this point the work around I have is to reduce the amount of I/O. I can get the system to crash within an hour by rsync'ing at full speed to this system, but I can delay the inevitable (by around a few hours to a day) by rsync'ing with a bandwidth limit of 1M.
I've run memtest on this system for around 10 hours and not encountered any reports of bad RAM.
I'm running archlinux with a manually compiled kernel (vanilla with spl/zfs patches):
kernel tag v3.17-rc5 on commit 9e82bf014195d6f0054982c463575cdce24292be
SPL at f9bde4f
ZFS at 2d50158
SPL/ZFS are compiled into the kernel statically with no loaded modules. ZFS userland tools are compiled from the same commit.
Unfortunately due to the nature of the crash, I don't have too many logs of it happening, but they all pretty much end up in lzjb_decompres with zio_decompress at the top of the stack (note that this panic is vanilla and does not contain the no-cache compression patch I made):
The text was updated successfully, but these errors were encountered: