-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broken ZIL (during heavy testing of #6566) #6856
Comments
Looking at the replay headers in zdb, it looks like they're all at 0:
Rather curious to see how this can get tracked down with zdb. |
Adding -hh to the invocation produces a ton of:
Pulling the TXGs from that output, sorting them, and using either of the last two via zpool import -FX -T 36719543 -N pool-ID results in |
As i go through and re-add PRs to the master branch to figure out where the errors started, i'm beginning to think that the crypto fmt changes might be related since i've beeen building most of the recent stacks with the datto/crypt_disk_fmt branch pulled in. @tcaputi: am i doing something obscenely dumb pulling off that branch? |
Hmm, so somewhat at the end of permissible rope here - have all the data from the pool staged from recovery on adjacent SSD pool, and will need to pull the trigger on running labelclear on the constituent VDEVs sometime this morning to seed a new pool with the restored data unless i can recover. |
I would be surprised if that branch broke the ZIL just because it doesn't really touch that, but that code is really raw and completely untested. In fact, I have since decided to take a different approach and have created a new branch for that, so it is unlikely that the code from that branch will end up getting merged. |
@tcaputi: as i mentioned in #6566, i think its actually the improved scrub branch that's causing this. The PR has been in approved state for quite a while and just running as part of our test and now prod stacks, but may have developed unhappy interactions in the interrim. Without the now abandoned branch and the scrub improvements stuff i am no longer seeing the source DVA error which started all of this. Have you guys seen any crashes in ztest on that branch? |
We haven't seen any crashes in ztest or production tests in quite some time on that patch (or at least not anything that doesn't occasionally pop up in other PR's). The scrub code doesn't really interact with the ZIL in any new ways so I wwould be very surprised if this caused any issues. |
So i destroyed the pool, restored from an older snap, and after another crash, again, i see:
when i try to import after a reboot. Not great |
Closed as stale. |
While trying to diagnose a DVA error seen in ztest on some hosts, i seem to have broken a production pool's intent log. The pool is a raidz consisting of 5 SSDs on dm-crypt - a pretty much universal setup around here. During one of the crashes seen yesterday, something bad happened to the ZIL such that import fails crashing the system spectacularly:
Tried to import with different revisions of ZFS compatible with the feature flags on that pool (luckily not a crypto pool), all with the same result. The -m and -F flags also give nada. Google searches for ignoring the ZIL lead to SLOG-related issues, though i seem to recall that someone had a patch allowing import dropping the current ZIL state to avoid this sort of mess.
I'm pushing a restore from just before this all started back to another pool in the system, but would be nice to not be missing a couple of days of delta.
@dweeezil, @prakashsurya, @behlendorf: do you folks happen to recall if and where such a ZIL-drop patch would be? Probably something we want to doc and push up in the search results if it exists.
The text was updated successfully, but these errors were encountered: