-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NULL pointer dereference when attempting to destroy snapshots en-mass to allow a pool to resilver #8237
Comments
Tried upgrading to zfs 0.7.12 ( The effect seems to be any operations involving snapshots stop working, and the pool stops being able to accept writes though my system stays up till I reboot. A zpool resilver operation will run to completion, but fail to ever actually finish replacing resilvered disks (i.e. it completes but the old disk cannot be detached, offlined or removed).
|
So the pool in question which has this problem is still struggling along, which is to say, the affected datasets can be moved around and renamed, but not deleted (because the kernel thread crashes with the above). I've been doing a little digging and to my (unexperienced) eyes, this looks like a logic bug in the relevant functions: (comments added inline to show my thinking) From static void
dsl_deadlist_insert_bpobj(dsl_deadlist_t *dl, uint64_t obj, uint64_t birth,
dmu_tx_t *tx)
{
dsl_deadlist_entry_t dle_tofind;
dsl_deadlist_entry_t *dle;
avl_index_t where;
uint64_t used, comp, uncomp;
bpobj_t bpo;
ASSERT(MUTEX_HELD(&dl->dl_lock));
VERIFY0(bpobj_open(&bpo, dl->dl_os, obj));
VERIFY0(bpobj_space(&bpo, &used, &comp, &uncomp));
bpobj_close(&bpo);
dsl_deadlist_load_tree(dl);
dmu_buf_will_dirty(dl->dl_dbuf, tx);
dl->dl_phys->dl_used += used;
dl->dl_phys->dl_comp += comp;
dl->dl_phys->dl_uncomp += uncomp;
dle_tofind.dle_mintxg = birth;
// this is the start of where it goes wrong
dle = avl_find(&dl->dl_tree, &dle_tofind, &where);
// if this fails, we catch the NULL though and do avl_nearest...
if (dle == NULL)
// but this is never NULL checked, but can return NULL (and in my pool, I guess due to dataloss, does?
dle = avl_nearest(&dl->dl_tree, where, AVL_BEFORE);
dle_enqueue_subobj(dl, dle, obj, tx); // the crash happens inside here
} Looking at void *
avl_nearest(avl_tree_t *tree, avl_index_t where, int direction)
{
int child = AVL_INDEX2CHILD(where);
avl_node_t *node = AVL_INDEX2NODE(where);
void *data;
size_t off = tree->avl_offset;
if (node == NULL) { // nothing stopping us returning null
ASSERT(tree->avl_root == NULL);
return (NULL);
}
data = AVL_NODE2DATA(node, off);
if (child != direction)
return (data);
return (avl_walk(tree, data, direction));
} I'm not really clear enough on what the correct fix here would be, since the on-disk data is wrecked at this point. But it seems like, since the data is already recorded as a permanent error and non-existent, it should be possible to bail out of this in the snapshot delete scenario (or at least not crash a kernel thread). |
Thanks for digging in to this. It does seem that this function should include at least some recovery logic along the lines of what's done in |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
System information
Describe the problem you're observing
Snapshot deletion stalls and
zpool status -v
cannot list all errors when trying to delete snapshots on a pool with data errors.Describe how to reproduce the problem
Not sure - this has happened on a pool which is displaying this structure (due to a multi-disk failure)
The disk replacements refuse to complete, and I was attempting to remove damaged snapshots to allow them to complete.
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: