-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-9595 chk: consolidate pool membership #9865
Conversation
Bug-tracker data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
009ab53
to
7f0aa42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
7f0aa42
to
40b3869
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
40b3869
to
f9b9b36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
f9b9b36
to
d5fca4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
d5fca4d
to
3594b5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
3594b5d
to
b00ef00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
b00ef00
to
93b2031
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
When DAOS check start, all involved check engines will report their known pools' information, including the pool service replicas, pool label and related storage allocation, to the check leader via reply. After the pool list consolidation in the pass_1, for each pool, the check leader will send related pool information to its pool service leaders via new RPC - CHK_POOL_MBS. On the check engine side, the pool service leader compares the pool map with these information pushed from the check leader and handles the following cases: 1. An target has some allocated storage but does not appear in the pool map. Under such case, the associated space will be deleted from the engine by default. 2. An target has some allocated storage and is marked as "DOWN" or "DOWNOUT" in the pool map. For this case, the administrator can decide to either remove or leave it there. 3. An target is referenced in the pool map ("NEW", "UP", "UPIN" or "DRAIN"), but no storage is actually allocated on this engine. Under such case, the entry for the target in the pool map will be marked as "DOWN" (for the "UP", "UPIN" or "DRAIN" entry) or "DOWNOUT" (for the "NEW" entry). Temporarily skip code format check against src/chk/chk_internal.h and src/mgmt/rpc.h to avoid fake warning messages. Signed-off-by: Fan Yong <[email protected]>
93b2031
to
c906da6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
src/pool/srv_pool.c
Outdated
} | ||
|
||
int | ||
ds_pool_svc_flush_map(struct ds_pool_svc *ds_svc, struct pool_map *map, uint32_t version) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question] Is it intentional that we do not schedule rebuild jobs when updating the pool map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pool map update is driven by the DAOS check instead of regular rebuild. The logic is something like that:
For each target that reported as part of the pool, compare with the pool map and fix related inconsistency; and then handle those non-accessed (in former comparison) pool map entries. During these process, there will be yield because of RPC or interaction with admin. All related pool map fixes are in DRAM in this step. After all done, it will call ds_pool_svc_flush_map() to persistently change the pool map and broadcast the changes to other pool shards.
src/pool/srv_pool.c
Outdated
/* | ||
* Have toresign to avoid handling future requests with stale pool map cache. | ||
* Continue to distribute the new pool map to other pool shards since the RDB | ||
* has already been updated. | ||
*/ | ||
rdb_resign(svc->ps_rsvc.s_db, svc->ps_rsvc.s_term); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing to update the local map implies that the local secondary group for this pool may not have the latest membership. In this state, it's simpler to just give up, instead of continuing to take some chance. When a new leader steps up, it will distribute the new pool map, schedule rebuild jobs, (and in the future also do what replace_failed_replicas does).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under DAOS check mode, in spite of PS leader itself is down or some other is down during checking the pool membership, then the DAOS check for this pool will be marked as aborted.
On the other hand, if the PS leader switches to other engine because of short time network split without engine down, then it will not cause DAOS check to be failed even if both the old PS leader and the new PS leader do DAOS pool membership check in parallel. It may cause some redundant check, but not error.
So here, the current PS leader must has the latest membership. It has already updated the pool map in RDB, but failed to refresh the pool map (attached to the pool instance) in-DRAM. If we give up, means the DAOS check for this pool will be failed, but if we try to distributed the update to other engines, we may have chance to continue the DAOS check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest of the PS code assumes that the local secondary group always reflects the latest pool map. At least, I'd suggest skipping the following the map_dist and reconfigure calls in this case to avoid adding burden to non-chk code. Also, do you really intend to return the nonzero rc to chk like this patch does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, if do not have others to change, I prefer to adjust it in the subsequent patch #9867 which will be rebased after landing this one.
As for the return value, I think it is better to return it to the caller. It is the caller's duty to determine the next step. For CHK case, if do not specify "failout", then will go ahead.
How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fine with me.
if (file == NULL) { | ||
D_ERROR(DF_UUIDF": failed to allocate file name for shards status %d\n", | ||
DP_UUID(uuid), i); | ||
D_GOTO(out_path, rc = -DER_NOMEM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we leak clue->pc_tgt_status on the error paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I will refresh the patch to fix it.
X(MGMT_TGT_SHARD_DESTROY, \ | ||
0, &CQF_mgmt_tgt_shard_destroy, \ | ||
ds_mgmt_hdlr_tgt_shard_destroy, NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, do we need to bump DAOS_MGMT_VERSION above for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, do we need to add support for proto query like we did with object and pool RPCs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it seems unnecessary because we do not support interoperation among servers.
src/chk/chk_leader.c
Outdated
|
||
D_ALLOC_ARRAY(cpr->cpr_mbs, cpr->cpr_shard_nr); | ||
if (cpr->cpr_mbs == NULL) | ||
D_GOTO(out, rc = -DER_NOMEM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On this error path we'll finalize an rsvc_client that has not been initialized. Could we avoid doing this even if it might work at the moment, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will refresh the patch.
src/chk/chk_engine.c
Outdated
* Flush the pool map to persistent storage (if not under dryrun mode) | ||
* and distribute the pool map to other pool shards. | ||
*/ | ||
rc1 = ds_pool_svc_flush_map(svc, map, version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question] The map object already includes the version; the version parameter is unnecessary, isn't it? On the other hand, do you think we should pass a version for ds_pool_svc_flush_map to check before writing the new map? For instance,
read the map: version x
change the map
if the old map is not version x
try reading and changing again
write the new map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The map object already includes the version; the version parameter is unnecessary, isn't it?
Right, I will drop the redundant parameter.
On the other hand, do you think we should pass a version for ds_pool_svc_flush_map to check before writing the new map? For instance,
Under check mode, we disabled node eviction and do not allow reintegration. Means that there will be no pool map refresh except the DAOS check logic itself update the pool map. So it is unnecessary to re-check the pool map version before ds_pool_svc_flush_map(). On the other hand, if the ds_pool_svc_flush_map() is called under non-check mode, then the sponsor needs to do as you described.
c906da6
to
cc381a0
Compare
src/common/pool_map.c
Outdated
pool_map_bump_version(struct pool_map *map) | ||
{ | ||
map->po_version++; | ||
D_DEBUG(DB_TRACE, "Bumb pool map to version %u\n", map->po_version); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D_DEBUG(DB_TRACE, "Bumb pool map to version %u\n", map->po_version); | |
D_DEBUG(DB_TRACE, "Bump pool map to version %u\n", map->po_version); |
cc381a0
to
3d9c88a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
3d9c88a
to
929dccd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9865/14/execution/node/1046/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question about proto query. If we have a new version for mgmt RPC, do we need to add query?
929dccd
to
b668567
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9865/15/execution/node/320/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9865/15/execution/node/364/log |
Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9865/15/execution/node/323/log |
Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9865/15/execution/node/342/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-9865/15/execution/node/328/log |
b668567
to
dd5b38f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
dd5b38f
to
d096386
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
@liw @liuxuezhao @jolivier23 , would you please to help review the patch? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a proto query for the MGMT RPC change?
@mjmac , would you please to help to hand this one? Then I can rebase the subsequent, thanks! |
Currently, it seems unnecessary because we do not support interoperation among servers. |
When DAOS check start, all involved check engines will report their
known pools' information, including the pool service replicas, pool
label and related storage allocation, to the check leader via reply.
After the pool list consolidation in the pass_1, for each pool, the
check leader will send related pool information to its pool service
leaders via new RPC - CHK_POOL_MBS.
On the check engine side, the pool service leader compares the pool
map with these information pushed from the check leader and handles
the following cases:
An target has some allocated storage but does not appear in the
pool map. Under such case, the associated space will be deleted
from the engine by default.
An target has some allocated storage and is marked as "DOWN" or
"DOWNOUT" in the pool map. For this case, the administrator can
decide to either remove or leave it there.
An target is referenced in the pool map ("NEW", "UP", "UPIN" or
"DRAIN"), but no storage is actually allocated on this engine.
Under such case, the entry for the target in the pool map will
be marked as "DOWN" (for the "UP", "UPIN" or "DRAIN" entry) or
"DOWNOUT" (for the "NEW" entry).
Temporarily skip code format check against src/chk/chk_internal.h
and src/mgmt/rpc.h to avoid fake warning messages.
Signed-off-by: Fan Yong [email protected]