Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
DAOS-10622 pool: Fix ds_pool_get_version for NULL sp_map
Makito and Samir observed the following assertion failure after restarting engines. #0 raise () from /lib64/libc.so.6 #1 abort () from /lib64/libc.so.6 #2 __assert_fail_base () from /lib64/libc.so.6 #3 __assert_fail () from /lib64/libc.so.6 #4 pool_map_get_version (map=0x0) at src/common/pool_map.c:2852 #5 ds_pool_get_version (pool=0x7f0ca063c690, pool=0x7f0ca063c690) at src/include/daos_srv/pool.h:296 #6 pc=rpc@entry=0x7f0ca0998d30, p_rpt=p_rpt@entry=0x7f0ca83a77b0) at src/rebuild/srv.c:2101 #7 rebuild_tgt_scan_handler (rpc=0x7f0ca0998d30) at src/rebuild/scan.c:954 #8 crt_handle_rpc (arg=0x7f0ca0998d30) at src/cart/crt_rpc.c:1654 #9 ABTD_ythread_func_wrapper (p_arg=0x7f0ca83a78a0) at arch/abtd_ythread.c:21 #10 make_fcontext () from /usr/lib64/libabt.so.1 #11 ?? () The ds_pool_get_version call passed a NULL map argument to pool_map_get_version. The ds_pool.sp_map field may be NULL after the pool is started but before the pool receives the initial pool map from the pool service. This patch fixes ds_pool_get_version to return 0, which is less than all valid pool map versions, when sp_map is NULL, resulting in rebuild retries like this: Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device or resource busy') Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [failed] (pool 3bf68c9c ver=2 status=DER_BUSY(-1012): 'Device or resource busy') Rebuild [queued] (pool=3bf68c9c ver=2) tgts=2 Rebuild [started] (pool 3bf68c9c ver=2) Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...] Rebuild [scanning] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0, [...] Rebuild [completed] (pool 3bf68c9c ver=2, toberb_obj=0, rb_obj=0,[...] Target[2] (rank 2 idx 0 status 16 ver 1) is excluded. Also, this patch removes some rebuild code that handles NULL ds_pool.sp_group fields. Those can not happen as we always initialize sp_group (as well as sp_iv_ns) before putting a ds_pool object into the LRU. Signed-off-by: Li Wei <[email protected]>
- Loading branch information