-
Notifications
You must be signed in to change notification settings - Fork 178
SPLError: 1620:0:(dbuf.c:101:dbuf_dest()) SPL PANIC #371
Comments
I just got a similar panic on a different machine. This one does not have ECC RAM, but it also had
The setup on this box is slightly simpler. It has two small one disk pools, but no L2ARC. The full dump:
|
@OmenWild Could you please post the contents of |
Unfortunately I have rebooted both boxes and disabled Also keep in mind that the box providing the backtrace in the second post did not have any L2ARC at the time. Current arcstats from the box providing the two back traces in the first post has 16GB RAM:
|
Since I disabled |
I have a theory which explains this. One which will probably interest @ryao. Normally when an object is free'd back to the spl slab the registered destructors for that object are not run until the slab itself is freed. However, when using the Linux slab we must run the registered destructors immediately in the context on According to the VERIFY() in the crash the act of calling This has the potential to cause all sorts of otherwise mysterious problems which are exceptionally hard to debug. We definitely need to run down this race and get it resolved. This is also a really good argument for immediately calling the destructor as @ryao has proposed... although it's also now clear we can't do that until we find and fix this race... |
@OmenWild Are you using xattr=sa? @behlendorf A double free make sense here. We should try leveraging address sanitizer to help us track it down: https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel |
No |
@ryao I suspect we can get quite far with this just through code inspection now that we have a couple stacks. |
A cursory review of the code failed to reveal the double free. I am splitting the time I spend on this between code inspection and running address sanitizer on ztest. I have caught a few bugs via the latter method, but nothing that constitutes a serious issue like a double free. |
@ryao Sure, address sanitizers are a good thing. I was just suggesting we probably don't need one to find this issue. We could also just use the SLAB_RED_ZONE and SLAB_POISON slab flags and a debug kernel to verify this.
Note that this doesn't need to be a double free, a use-after-free would cause the same issue. |
This showed up in the buildbot on CentOS 7: |
Closing, this hasn't been reproduced to my knowledge in years and has very likely been fixed. We can reopen in the ZFS track if someone is able to reproduce this. |
Twice since upgrading to 0.6.3 I have seen SPL PANIC's. They happen shortly after a cronjob kicks off a rsync of / (excluding the zpool of course) to a backup pool on ZFS. The latest was with kernel is 3.14.10, the previous would have been 3.14.9 .
ZFS/SPL version is the released packages:
Both times the process to cause the PANIC was arc_adapt. The first time, the system kept working for around 36 hours then hung. This time I'm going to reboot it before it gets there.
First backtrace from July 5th:
Second backtrace from today (July 8th):
I am running with ECC RAM and the system seems otherwise stable, so it doesn't seem immediately like a hardware error.
My two small (~2TB each) zpools are otherwise healthy and working fine. One pool has ~80GB as l2arc.
Module options are:
Is there more data I can provide to help narrow this down?
The text was updated successfully, but these errors were encountered: