Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
3956 ::vdev -r should work with pipelines
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Adam Leventhal <[email protected]>
Approved by: Richard Lowe <[email protected]>

References:
  https://www.illumos.org/issues/3956
  https://www.illumos.org/issues/3957
  https://www.illumos.org/issues/3958
  https://www.illumos.org/issues/3959
  https://www.illumos.org/issues/3960
  https://www.illumos.org/issues/3961
  https://www.illumos.org/issues/3962
  illumos/illumos-gate@b4952e1

Ported-by: Richard Yao <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>

Porting notes:

1. zfs_dbgmsg_print() is only used in userland. Since we do not have
   mdb on Linux, it does not make sense to make it available in the
   kernel. This means that a build failure will occur if any future
   kernel patch depends on it. However, that is unlikely given that
   this functionality was added to support zdb.

2. zfs_dbgmsg_print() is only invoked for -VVV or greater log levels.
   This preserves the existing behavior of minimal noise when running
   with -V, and -VV.

3. In vdev_config_generate() the call to nvlist_alloc() was not
   changed to fnvlist_alloc() because we must pass KM_PUSHPAGE in
   the txg_sync context.
  • Loading branch information
grwilson authored and behlendorf committed Nov 5, 2013
1 parent 621dd7b commit 5d1f7fb
Show file tree
Hide file tree
Showing 11 changed files with 233 additions and 95 deletions.
25 changes: 20 additions & 5 deletions cmd/ztest/ztest.c
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2012 by Delphix. All rights reserved.
* Copyright (c) 2013 by Delphix. All rights reserved.
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
* Copyright (c) 2013 Steven Hartland. All rights reserved.
*/
Expand Down Expand Up @@ -789,6 +789,18 @@ ztest_kill(ztest_shared_t *zs)
{
zs->zs_alloc = metaslab_class_get_alloc(spa_normal_class(ztest_spa));
zs->zs_space = metaslab_class_get_space(spa_normal_class(ztest_spa));

/*
* Before we kill off ztest, make sure that the config is updated.
* See comment above spa_config_sync().
*/
mutex_enter(&spa_namespace_lock);
spa_config_sync(ztest_spa, B_FALSE, B_FALSE);
mutex_exit(&spa_namespace_lock);

if (ztest_opts.zo_verbose >= 3)
zfs_dbgmsg_print(FTAG);

(void) kill(getpid(), SIGKILL);
}

Expand Down Expand Up @@ -2774,7 +2786,7 @@ ztest_vdev_attach_detach(ztest_ds_t *zd, uint64_t id)
uint64_t leaf, top;
uint64_t ashift = ztest_get_ashift();
uint64_t oldguid, pguid;
size_t oldsize, newsize;
uint64_t oldsize, newsize;
char *oldpath, *newpath;
int replacing;
int oldvd_has_siblings = B_FALSE;
Expand Down Expand Up @@ -2935,8 +2947,8 @@ ztest_vdev_attach_detach(ztest_ds_t *zd, uint64_t id)
if (error != expected_error && expected_error != EBUSY) {
fatal(0, "attach (%s %llu, %s %llu, %d) "
"returned %d, expected %d",
oldpath, (longlong_t)oldsize, newpath,
(longlong_t)newsize, replacing, error, expected_error);
oldpath, oldsize, newpath,
newsize, replacing, error, expected_error);
}
out:
mutex_exit(&ztest_vdev_lock);
Expand Down Expand Up @@ -4957,7 +4969,7 @@ ztest_fault_inject(ztest_ds_t *zd, uint64_t id)
*/
if (vd0 != NULL && maxfaults != 1 &&
(!vdev_resilver_needed(vd0->vdev_top, NULL, NULL) ||
vd0->vdev_resilvering)) {
vd0->vdev_resilver_txg != 0)) {
/*
* Make vd0 explicitly claim to be unreadable,
* or unwriteable, or reach behind its back
Expand Down Expand Up @@ -5817,6 +5829,9 @@ ztest_run(ztest_shared_t *zs)
zs->zs_alloc = metaslab_class_get_alloc(spa_normal_class(spa));
zs->zs_space = metaslab_class_get_space(spa_normal_class(spa));

if (ztest_opts.zo_verbose >= 3)
zfs_dbgmsg_print(FTAG);

umem_free(tid, ztest_opts.zo_threads * sizeof (kt_did_t));

/* Kill the resume thread */
Expand Down
31 changes: 31 additions & 0 deletions include/sys/dsl_scan.h
Original file line number Diff line number Diff line change
Expand Up @@ -72,11 +72,42 @@ typedef enum dsl_scan_flags {
DSF_VISIT_DS_AGAIN = 1<<0,
} dsl_scan_flags_t;

/*
* Every pool will have one dsl_scan_t and this structure will contain
* in-memory information about the scan and a pointer to the on-disk
* representation (i.e. dsl_scan_phys_t). Most of the state of the scan
* is contained on-disk to allow the scan to resume in the event of a reboot
* or panic. This structure maintains information about the behavior of a
* running scan, some caching information, and how it should traverse the pool.
*
* The following members of this structure direct the behavior of the scan:
*
* scn_pausing - a scan that cannot be completed in a single txg or
* has exceeded its allotted time will need to pause.
* When this flag is set the scanner will stop traversing
* the pool and write out the current state to disk.
*
* scn_restart_txg - directs the scanner to either restart or start a
* a scan at the specified txg value.
*
* scn_done_txg - when a scan completes its traversal it will set
* the completion txg to the next txg. This is necessary
* to ensure that any blocks that were freed during
* the scan but have not yet been processed (i.e deferred
* frees) are accounted for.
*
* This structure also maintains information about deferred frees which are
* a special kind of traversal. Deferred free can exist in either a bptree or
* a bpobj structure. The scn_is_bptree flag will indicate the type of
* deferred free that is in progress. If the deferred free is part of an
* asynchronous destroy then the scn_async_destroying flag will be set.
*/
typedef struct dsl_scan {
struct dsl_pool *scn_dp;

boolean_t scn_pausing;
uint64_t scn_restart_txg;
uint64_t scn_done_txg;
uint64_t scn_sync_start_time;
zio_t *scn_zio_root;

Expand Down
4 changes: 2 additions & 2 deletions include/sys/fs/zfs.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@

/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2012 by Delphix. All rights reserved.
* Copyright (c) 2013 by Delphix. All rights reserved.
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
* Copyright (c) 2012, Joyent, Inc. All rights reserved.
*/
Expand Down Expand Up @@ -530,7 +530,7 @@ typedef struct zpool_rewind_policy {
#define ZPOOL_CONFIG_SPLIT_GUID "split_guid"
#define ZPOOL_CONFIG_SPLIT_LIST "guid_list"
#define ZPOOL_CONFIG_REMOVING "removing"
#define ZPOOL_CONFIG_RESILVERING "resilvering"
#define ZPOOL_CONFIG_RESILVER_TXG "resilver_txg"
#define ZPOOL_CONFIG_COMMENT "comment"
#define ZPOOL_CONFIG_SUSPENDED "suspended" /* not stored on disk */
#define ZPOOL_CONFIG_TIMESTAMP "timestamp" /* not stored on disk */
Expand Down
4 changes: 2 additions & 2 deletions include/sys/vdev_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2012 by Delphix. All rights reserved.
* Copyright (c) 2013 by Delphix. All rights reserved.
*/

#ifndef _SYS_VDEV_IMPL_H
Expand Down Expand Up @@ -182,7 +182,7 @@ struct vdev {
uint64_t vdev_faulted; /* persistent faulted state */
uint64_t vdev_degraded; /* persistent degraded state */
uint64_t vdev_removed; /* persistent removed state */
uint64_t vdev_resilvering; /* persistent resilvering state */
uint64_t vdev_resilver_txg; /* persistent resilvering state */
uint64_t vdev_nparity; /* number of parity devices for raidz */
char *vdev_path; /* vdev path (if any) */
char *vdev_devid; /* vdev devid (if any) */
Expand Down
3 changes: 2 additions & 1 deletion include/sys/zfs_debug.h
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2012 by Delphix. All rights reserved.
* Copyright (c) 2013 by Delphix. All rights reserved.
*/

#ifndef _SYS_ZFS_DEBUG_H
Expand Down Expand Up @@ -95,6 +95,7 @@ extern void zfs_dbgmsg_fini(void);
#define zfs_dbgmsg(...) dprintf(__VA_ARGS__)
#else
extern void zfs_dbgmsg(const char *fmt, ...);
extern void zfs_dbgmsg_print(const char *tag);
#endif

#ifndef _KERNEL
Expand Down
21 changes: 16 additions & 5 deletions module/zfs/dsl_scan.c
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,7 @@ dsl_scan_setup_sync(void *arg, dmu_tx_t *tx)
scn->scn_phys.scn_errors = 0;
scn->scn_phys.scn_to_examine = spa->spa_root_vdev->vdev_stat.vs_alloc;
scn->scn_restart_txg = 0;
scn->scn_done_txg = 0;
spa_scan_stat_init(spa);

if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) {
Expand Down Expand Up @@ -776,7 +777,7 @@ dsl_scan_visitbp(blkptr_t *bp, const zbookmark_t *zb,
* Don't scan it now unless we need to because something
* under it was modified.
*/
if (bp->blk_birth <= scn->scn_phys.scn_cur_max_txg) {
if (BP_PHYSICAL_BIRTH(bp) <= scn->scn_phys.scn_cur_max_txg) {
scan_funcs[scn->scn_phys.scn_func](dp, bp, zb);
}
if (buf)
Expand Down Expand Up @@ -1227,7 +1228,7 @@ dsl_scan_ddt_entry(dsl_scan_t *scn, enum zio_checksum checksum,

for (p = 0; p < DDT_PHYS_TYPES; p++, ddp++) {
if (ddp->ddp_phys_birth == 0 ||
ddp->ddp_phys_birth > scn->scn_phys.scn_cur_max_txg)
ddp->ddp_phys_birth > scn->scn_phys.scn_max_txg)
continue;
ddt_bp_create(checksum, ddk, ddp, &bp);

Expand Down Expand Up @@ -1475,6 +1476,16 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
if (scn->scn_phys.scn_state != DSS_SCANNING)
return;

if (scn->scn_done_txg == tx->tx_txg) {
ASSERT(!scn->scn_pausing);
/* finished with scan. */
zfs_dbgmsg("txg %llu scan complete", tx->tx_txg);
dsl_scan_done(scn, B_TRUE, tx);
ASSERT3U(spa->spa_scrub_inflight, ==, 0);
dsl_scan_sync_state(scn, tx);
return;
}

if (scn->scn_phys.scn_ddt_bookmark.ddb_class <=
scn->scn_phys.scn_ddt_class_max) {
zfs_dbgmsg("doing scan sync txg %llu; "
Expand Down Expand Up @@ -1510,9 +1521,9 @@ dsl_scan_sync(dsl_pool_t *dp, dmu_tx_t *tx)
(longlong_t)NSEC2MSEC(gethrtime() - scn->scn_sync_start_time));

if (!scn->scn_pausing) {
/* finished with scan. */
zfs_dbgmsg("finished scan txg %llu", (longlong_t)tx->tx_txg);
dsl_scan_done(scn, B_TRUE, tx);
scn->scn_done_txg = tx->tx_txg + 1;
zfs_dbgmsg("txg %llu traversal complete, waiting till txg %llu",
tx->tx_txg, scn->scn_done_txg);
}

if (DSL_SCAN_IS_SCRUB_RESILVER(scn)) {
Expand Down
11 changes: 3 additions & 8 deletions module/zfs/spa.c
Original file line number Diff line number Diff line change
Expand Up @@ -4509,7 +4509,7 @@ spa_vdev_attach(spa_t *spa, uint64_t guid, nvlist_t *nvroot, int replacing)
}

/* mark the device being resilvered */
newvd->vdev_resilvering = B_TRUE;
newvd->vdev_resilver_txg = txg;

/*
* If the parent is not a mirror, or if we're replacing, insert the new
Expand Down Expand Up @@ -5370,13 +5370,6 @@ spa_vdev_resilver_done_hunt(vdev_t *vd)
return (oldvd);
}

if (vd->vdev_resilvering && vdev_dtl_empty(vd, DTL_MISSING) &&
vdev_dtl_empty(vd, DTL_OUTAGE)) {
ASSERT(vd->vdev_ops->vdev_op_leaf);
vd->vdev_resilvering = B_FALSE;
vdev_config_dirty(vd->vdev_top);
}

/*
* Check for a completed replacement. We always consider the first
* vdev in the list to be the oldest vdev, and the last one to be
Expand Down Expand Up @@ -5466,6 +5459,8 @@ spa_vdev_resilver_done(spa_t *spa)
ASSERT(pvd->vdev_ops == &vdev_replacing_ops);
sguid = ppvd->vdev_child[1]->vdev_guid;
}
ASSERT(vd->vdev_resilver_txg == 0 || !vdev_dtl_required(vd));

spa_config_exit(spa, SCL_ALL, FTAG);
if (spa_vdev_detach(spa, guid, pguid, B_TRUE) != 0)
return;
Expand Down
9 changes: 7 additions & 2 deletions module/zfs/spa_config.c
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
* Copyright (c) 2012 by Delphix. All rights reserved.
* Copyright (c) 2013 by Delphix. All rights reserved.
*/

#include <sys/spa.h>
Expand Down Expand Up @@ -196,7 +196,12 @@ spa_config_write(spa_config_dirent_t *dp, nvlist_t *nvl)

/*
* Synchronize pool configuration to disk. This must be called with the
* namespace lock held.
* namespace lock held. Synchronizing the pool cache is typically done after
* the configuration has been synced to the MOS. This exposes a window where
* the MOS config will have been updated but the cache file has not. If
* the system were to crash at that instant then the cached config may not
* contain the correct information to open the pool and an explicity import
* would be required.
*/
void
spa_config_sync(spa_t *target, boolean_t removing, boolean_t postsysevent)
Expand Down
Loading

0 comments on commit 5d1f7fb

Please sign in to comment.