Skip to content

Commit

Permalink
DAOS-10622 pool: Fix ds_pool_get_version for NULL sp_map (#9277)
Browse files Browse the repository at this point in the history
Makito and Samir observed the following assertion failure after
restarting engines.

  #0  raise () from /lib64/libc.so.6
  #1  abort () from /lib64/libc.so.6
  #2  __assert_fail_base () from /lib64/libc.so.6
  #3  __assert_fail () from /lib64/libc.so.6
  #4  pool_map_get_version (map=0x0) at src/common/pool_map.c:2852
  #5  ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at
      src/include/daos_srv/pool.h:296
  #6  pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at
      src/rebuild/srv.c:2101
  #7  rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at
      src/rebuild/scan.c:954
  #8  crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654
  #9  ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at
      arch/abtd_ythread.c:21
  #10 make_fcontext () from /usr/lib64/libabt.so.1
  #11 ?? ()

The ds_pool_get_version call passed a NULL map argument to
pool_map_get_version. The ds_pool.sp_map field may be NULL after the
pool is started but before the pool receives the initial pool map from
the pool service. This patch fixes ds_pool_get_version to return 0,
which is less than all valid pool map versions, when sp_map is NULL,
resulting in rebuild retries like this:

  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
    or resource busy')
  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device
    or resource busy')
  Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2
  Rebuild [started] (pool 3bf68c9c ver=2)
  Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
  Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...]
  Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...]
  Target[2] (rank 2 idx 0 status 16 ver 1) is excluded.

Also, this patch removes some rebuild code that handles NULL
ds_pool.sp_group fields. Those can not happen as we always initialize
sp_group (as well as sp_iv_ns) before putting a ds_pool object into the
LRU.

Signed-off-by: Li Wei <[email protected]>
  • Loading branch information
liw authored Jul 19, 2022
1 parent e275b48 commit 535febf
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 14 deletions.
5 changes: 3 additions & 2 deletions src/include/daos_srv/pool.h
Original file line number Diff line number Diff line change
Expand Up @@ -297,10 +297,11 @@ ds_pool_rf_verify(struct ds_pool *pool, uint32_t last_ver, uint32_t rlvl, uint32
static inline uint32_t
ds_pool_get_version(struct ds_pool *pool)
{
uint32_t ver;
uint32_t ver = 0;

ABT_rwlock_rdlock(pool->sp_lock);
ver = pool_map_get_version(pool->sp_map);
if (pool->sp_map != NULL)
ver = pool_map_get_version(pool->sp_map);
ABT_rwlock_unlock(pool->sp_lock);

return ver;
Expand Down
14 changes: 2 additions & 12 deletions src/rebuild/srv.c
Original file line number Diff line number Diff line change
Expand Up @@ -2111,19 +2111,9 @@ rebuild_tgt_prepare(crt_rpc_t *rpc, struct rebuild_tgt_pool_tracker **p_rpt)
D_GOTO(out, rc = -DER_BUSY);
}

if (pool->sp_group == NULL) {
char id[DAOS_UUID_STR_SIZE];

uuid_unparse_lower(pool->sp_uuid, id);
pool->sp_group = crt_group_lookup(id);
if (pool->sp_group == NULL) {
D_ERROR(DF_UUID": pool group not found\n",
DP_UUID(pool->sp_uuid));
D_GOTO(out, rc = -DER_INVAL);
}
}

D_ASSERT(pool->sp_group != NULL);
D_ASSERT(pool->sp_iv_ns != NULL);

/* Let's invalidate local snapshot cache before
* rebuild, so to make sure rebuild will use the updated
* snapshot during rebuild fetch, otherwise it may cause
Expand Down

0 comments on commit 535febf

Please sign in to comment.