-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 2094 #2109
Issue 2094 #2109
Conversation
Commit 1421c89 added a field to zbookmark_t that unintentinoally caused a disk format change. This negatively affected backward compatibility and platform portability. Therefore, this field is being removed. The function that field permitted is left unimplemented until a later patch that will reimplement the field in a way that does not affect the disk format. Signed-off-by: Richard Yao <[email protected]>
ZoL commit 1421c89 unintentionally changed the disk format in a forward-compatible, but not backward compatible way. This was accomplished by adding an entry to zbookmark_t, which is included in a couple of on-disk structures. That lead to the creation of pools with incorrect dsl_scan_phys_t objects that could only be imported by versions of ZFSOnLinux containing that commit. Such pools cannot be imported by other versions of ZFS. That includes past and future versions of ZoL. The additional field has been removed by a prior patch. This patch permits affected pools to be imported and repaired by implementing a workaround for the erratum. When this workaround is triggered, a kernel alert describing this will be printed to dmesg asking the system administrator to run a scrub. A subsequent scrub will replace the damaged dsl_scan_phys_t object, repairing the pool. Pools affected by the damaged dsl_scan_phys_t can be detected prior to an upgrade by running the following command as root: zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w Note that `poolname` must be replaced with the name of the pool you wish to check. A value of 25 indicates the dsl_scan_phys_t has been damaged. A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0 indicates that there has never been a scrub run on the pool. The regression caused by the change to zbookmark_t never made it into a tagged release or any Gentoo backports. Only those using HEAD were affected. However, this patch has a few limitations. There is no way to detect a damaged dsl_scan_phys_t object when it has occurred on 32-bit systems due to integer rounding that wrote incorrect information, but avoided the overflow on them. Correcting such issues requires triggering a scrub. In addition, bptree_entry_phys_t objects are also affected. These objects only exist during an asynchronous destroy and automating repair of damaged bptree_entry_phys_t objects is non-trivial. Any pools that have been imported by an affected version of ZoL must have all asynchronous destroy operations finish before export and subsequent import by a version containing this commit. Failure to do that will prevent pool import. The presence of any background destroys on any imported pools can be checked by running `zpool get freeing` as root. This will display a non-zero value for any pool with an active asynchronous destroy. Lastly, it is expected that no user data has been lost as a result of this erratum. Closes openzfs#2094 Original-patch-by: Tim Chase <[email protected]> Signed-off-by: Richard Yao <[email protected]>
Other than the note I just sent, looks good to me. These instructions should be in the release notes for 0.6.3 (we do have release notes, don't we :) |
I'm conflicted about this point. This will never be an issue for people who have only run the tagged releases and I've only written up release notes for the tags.
I usually write up something when I tag. Although I/we could do a lot better and commit them to the tree. |
What about adding a flag to the spa struct that would get scooped up here:
and then generate a message here:
|
@dweeezil I think that's a nice solution. It alleviates my concerns about an end user never knowing they need to do a scrub, and hopefully @ryao's concerns about doing it automatically. Perhaps something like this (untested): diff --git a/cmd/zpool/zpool_main.c b/cmd/zpool/zpool_main.c
index a856fd4..9b4b904 100644
--- a/cmd/zpool/zpool_main.c
+++ b/cmd/zpool/zpool_main.c
@@ -3903,6 +3903,19 @@ print_scan_status(pool_scan_stat_t *ps)
return;
}
+ /*
+ * Scan required, https://github.com/zfsonlinux/zfs/issues/2094.
+ */
+ if ((ps->pss_errata & POOL_ERRATA_2094) &&
+ (ps->pss_state == DSS_FINISHED || ps->pss_state == DSS_CANCELED)) {
+ (void) printf(gettext("detected pool compatibility issue\n"));
+ (void) printf(gettext(
+ " see: https://github.com/zfsonlinux/zfs/issues/2094\n"));
+ (void) printf(gettext
+ "action: To correct the issue run 'zpool scrub'.\n");
+ return;
+ }
+
start = ps->pss_start_time;
end = ps->pss_end_time;
zfs_nicenum(ps->pss_processed, processed_buf, sizeof (processed_buf)); Which should show the end user a message like this when they run
|
@behlendorf Yes, that's exactly what I had in mind. What are your thoughts on the need to handle the hopefully very rare situation where an async destroy is in progress? It seems we'd have to fail the import in that case. We could look at the new spa flag following the dsl_scan_init() call at the bottom of dsl_pool_open and bail if the earlier async destroy stuff had been triggered. Presumably just about any error we return will be propagated back as some sort of un-importable pool failure. |
@dweeezil I think we shouldn't try to handle the async destroy case which I expect will be very rare. For those few (if any) users who are be impacted they can just continue to run the version of the code they had until the async destroy completes. Once done they should be able to update their code run then run a scrub. I'd prefer to avoid inventing more machinery for this than we absolutely need too. |
@behlendorf I agree. |
@behlendorf @dweeezil We can partially handle the async_destroy case by checking if the feature is activated when we detect the corrupt object in dsl_scan_init(). There is already a check there, so we simply need to hook into it. This works because the feature is deactivated whenever an async_destroy finishes. |
@ryao Do you have an updated for this this? I'd like to push this over the finish line this week, I can propose the patch if your swamped. |
Closed in favor of #2112 |
This builds on previous work by @dweeezil in #2097 and should be ready to be merged.