-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VERIFY(c < (1ULL << 17) >> 9) failed, PANIC at zio.c:263:zio_data_buf_alloc() on import #2932
Comments
OK, this is probably all related to the xattr=sa vs. acltype=posixacl problem, as in openzfs/spl#390. Since the issue now occurs on importing the pool, is there any way to recover, perhaps even resorting to a hex editor? I don't have backups because these are the backups. :) I don't mind losing a filesystem or two, but losing the entire pool would be inconvenient. And no, -T doesn't help -- I tried thousands of txgs (counting back from the one printed by zdb -l), but none of them appear to be importable. And sorry about re-filing what now looks like a duplicate of openzfs/spl#390 -- I didn't realize this immediately. |
After finding the root cause of the dnode corruption (typically triggered by xattr=sa and acltype=posixacl), I gave up on trying to craft a workaround. It would be very difficult to catch all forms of corruption and, unfortunately, the handling of spill blocks spans several layers within the code, making a workaround that much more difficult. |
Does that mean I'm screwed and will have to destroy the pool? Can't I somehow find the corrupt dnodes and corrupt their checksum or something? Or maybe destroy specific filesystems without importing the pool? I'm pretty desperate here. :) |
@akorn Were either or both of the stack traces above the result of Have you tried setting |
No, the stack traces are without -T. With -T it doesn't crash, it just says the pool is not importable because some of the devices are missing. My goal would be to remove the corrupted data (entire corrupted filesystems if necessary, that's not a big deal) but restore the pool to a usable state. I have tried neither |
Woot, yes, setting those two options allowed me to import the pool! Is there a systematic way of finding out which filesystems are "bad" or should I take some shots in the, ah, insufficient light? Like I said, I can make informed guesses, but being certain would be even better. FWIW, I tried to mount all filesystems individually and could mount all but one (attempting to mount that one caused the box to lock up). Does this mean that it (and only it) contains this specific corruption that was preventing the pool from being imported, even with Also, all of my filesystems have many snapshots. Would it make sense to try to roll them all back to a previous snapshot? Since these are backups, I can easily lose a few. |
Well, attempting to destroy that filesystem doesn't seem to affect the pool itself very much, but it also doesn't succeed:
Is there any way I can get rid of the corrupt dataset? |
Now that you've issued the If you want to work with the pool in this state, I'd suggest renaming or removing any existing cache file (zpool.cache) and rebooting and doing a little investigation with zdb and not to try importing the pool right away. In this state, you'll need to use
Any error you get while zdb produces this list might be useful in developing some sort of workaround. Worst case, I suppose, it might be possible to add a hack to the import code to ignore and clear the deferred free list which would result in the leakage of all the space but should otherwise result in a usable pool. My recommendation, however, would be to try importing it to the txg prior to the destroy operation. You can use |
I issued the destroy but it didn't complete. The filesystem is still there (the kernel panicked before it could destory the fs). |
Good, then I'd suggest doing something like |
I can't mount it; mounting it leads to the same kind of panic that attempting to destroy it yields. Maybe that means the rootdir is corrupted? |
You could try running |
OK, that does work, but I'm not sure what I'm looking for. The output looks like this:
I suppose this is not the messed-up entry because it doesn't have a |
Investigating the children of the rootdir, I found some that had SPILL_BLKPTR set; how do I recognize the corrupt one(s)? And what do I do once I find them? Hmmm, I'm not using your enhanced zdb; I suppose I should try with that, maybe it'll tell me more. |
@dweeezil , I'm looking at your zdb now (from https://raw.githubusercontent.com/dweeezil/zfs/6efc9d3256d2aec8cd34533776f4bc0e37fc87b0/cmd/zdb/zdb.c), and it appears to be some modifications behind the latest master (e.g. in Should I attempt to merge these versions somehow, or is it fine to just use your latest version as it is? |
Your root directory does not appear to be corrupted but I do find it odd that it only has a default ACL and not a regular ACL. Also, the bonus size of 292 is suspiciously close to that at which a spill would be necessary. Generally speaking, a corrupted dnode will cause zdb to generate some sort of error. It might be more expedient to put some telemetry in the module such as a |
OK, assuming I find the corrupted dnode somehow -- what do I do with it then? |
Run the enhanced ZDB on it with 7-d and, depending on the type of corruption, we'll see if there's a way to craft a workaround. |
I wrote a quick'n'dirty script to traverse the fs using zdb and GNU parallel. It found one node where your zdb errors out like this (but it's still running, so there may be more):
|
I'll need a |
Thanks for all the help, by the way! |
@akorn I'm thinking the best solution to allow you to delete your corrupted filesystems is going to be to simply ignore all spill blocks and live with a bit of leaked space (the spill blocks would be orphaned). I've worked up one of the most gross hacks I can possibly imagine in https://github.com/dweeezil/zfs/tree/ignore-spill. It's extremely heavy-handed and is intended only for the purpose of importing a corrupted pool and running I ran did just that on a test filesystem containing a single spill block and afterward, a zdb shows:
which is exactly what I'd expect. You'll lose 2 blocks for every spill block (they're duped since they're considered to be metadata). I'd like to know whether you've found any other corrupted dnodes before you try running this patch. If the corruption is all of the same type, then this may just do the trick. |
So far there is only the one (corrupted dnode), and the recursive traversal is almost complete: 218137 objects examined of 231259 objects discovered so far -- of course it's just possible it'll stumble on a high-degree directory... How safe would you estimate your gross hack is? I can certainly live with some leaked space, but hosing the rest of the pool would be bad. :) Also, as of current master, is xattr=sa, acltype=posixacl believed to be safe? |
I'd suggest importing the pool with Given that the only SA layout on the filesystem with a small number of SA's is "5 = [20]" and that 20 is DXATTR, the only thing spill blocks are likely used for are xattrs. This means the gross hack should be pretty safe. |
@akorn Just checking in. Any progress with this? |
Thanks for keeping tabs on it; I got sidetracked but haven't forgotten. I'll try your gross hack in the following 3-4 days and I'll certainly report back. Incidentally, as of current master, is xattr=sa, acltype=posixacl believed to be safe? |
@akorn In my opinion, current master code which contains 4254acb should be safe. I'll note, however, that the bug fixed by this patch is not directly related to either |
@dweeezil I can't seem to compile your gross hack.
(There are more redefinitions.) Maybe this hacked version requires a specific kernel? Or I'm doing something wrong? I configured it with
(/usr/src/spl is the source I built spl from) |
@akorn How did you prepare the source code? Was it a straight checkout of the ignore-spill branch from https://github.com/dweeezil/zfs or did you try applying the dweeezil/zfs@35660b3 commit to some other tree? If the latter, I'd expect it not to work very well (nor to even apply cleanly for that matter). The former should work properly and appears to still be based on master code as of this moment. |
@dweeezil Um. Alas, I suck at git. I went to the URL you provided (https://github.com/dweeezil/zfs/tree/ignore-spill) and copied the URL from the field on the right (https://github.com/dweeezil/zfs.git) into a Looking at it now, it seems obvious that I didn't get the branch I should have. I now went and did a
Next, I ran Then, I ran And I got the same error message as before. I imagine I'm missing something obvious, but I don't know what. |
@akorn I'm not sure at the moment why you can't compile the module but the errors you're seeing don't have anything to do with my patch. I suspect you'd not be able to compile a current master checkout, either. Is there a chance your currently-installed spl isn't new enough? |
My spl is master as of 2014-11-25. I suspect the problem arises somehow due to my mixing a debianized installation of spl-dkms (built from the pkg-spl git tree), with the checkout directory I built the .deb from, and a complete source installation of your zfs branch. I don't even begin to understand the build process well enough to see how and why it breaks. FWIW, I have now (manually) applied your patch to the zfs master as of 25 November and am building a .deb from that. Applying the patch was straightforward, but I can't deny a certain amount of trepidation. :) |
Well; destroying the fs worked. I imported the pool with I now have this in dmesg:
Any suggestions? |
@akorn That's one of the groups of asserts my GH removed. There still may be a way to get this pool patched up. I guess I was a bit optimistic that a It would be interesting to see what the |
@dweeezil Yes, I realize I'm hitting an assertion your patch removed; I had hoped/thought removing the ASSERT would only be needed while I destroy the corrupt fs. zdb output:
Thanks for not giving up on this, btw. :) |
I have a new error message on mounting a specific fs:
|
@dweeezil Quick question regarding this bug. I have had this problem in the past (my post is referenced in the first post in this thread) but have subsequently destroyed any related pools. I am, however, still using xattr=sa and acltype=posixacl. I do need posixacl set, but is this error mitigated by setting xattr to default instead of sa? Or is this problem still possible when acltype=posixacl is set without xattr=sa? Just looking to stop this error from happening again altogether until a fix is out. |
@dweeezil Thanks, good to know. I'm not in a position where I will be able to update the ZFS build for a while though. So, in the meantime, will rsyncing to a dataset with acltype=posixacl and xattr=dir be sufficient to avoid this potential problem? |
@cointer Yes, but the correct option is |
Dont know if this is related, but i have a similar problem with 0.6.5 during a scrub. [64345.893387] VERIFY3(mintxg == dsl_dataset_phys(ds1)->ds_prev_snap_txg) failed (6614665 == 6614099) |
Closing. This should be resolved in the current master branch. @alexanderhaensch your issue looks a little different, can you open a new issue for it. |
Hi,
I have a pool that had the "ASSERTION(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT) failed, SPLError: 5408:0:(zio.c:263:zio_data_buf_alloc()) SPL PANIC" problem like in #2678:
I rebuilt zfsonlinux (spl as well as zfs) using current git master (so it has the fix from 4a7480a) and tried with that. Now I get a different panic message on import:
Would it make sense to try importing a previous txg using zpool import -T?
Also, fwiw, this box has ECC memory.
The text was updated successfully, but these errors were encountered: