Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-9595 chk: consolidate pool membership #9611

Closed
wants to merge 1 commit into from

Conversation

Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented Jul 5, 2022

When DAOS check start, all involved check engines will report their
known pools' information, including the pool service replicas, pool
label and related storage allocation, to the check leader via reply.

After the pool list consolidation in the pass_1, for each pool, the
check leader will send related pool information to its pool service
leaders via new RPC - CHK_POOL_MBS.

On the check engine side, the pool service leader compares the pool
map with these information pushed from the check leader and handles
the following cases:

  1. An target has some allocated storage but does not appear in the
    pool map. Under such case, the associated space will be deleted
    from the engine by default.

  2. An target has some allocated storage and is marked as "DOWN" or
    "DOWNOUT" in the pool map. For this case, the administrator can
    decide to either remove or leave it there.

  3. An target is referenced in the pool map ("NEW", "UP", "UPIN" or
    "DRAIN"), but no storage is actually allocated on this engine.
    Under such case, the entry for the target in the pool map will
    be marked as "DOWN" (for the "UP", "UPIN" or "DRAIN" entry) or
    "DOWNOUT" (for the "NEW" entry).

Temporarily skip code format check against src/chk/chk_internal.h
and src/mgmt/rpc.h to avoid fake warning messages.

Signed-off-by: Fan Yong [email protected]

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* CHK_POOL_MBS:
* From check leader to check engine to notify the pool members.
*/
#define DAOS_ISEQ_CHK_POOL_MBS \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) Macros with complex values should be enclosed in parentheses

((d_string_t) (cpmi_label) CRT_VAR) \
((struct chk_pool_mbs) (cpmi_targets) CRT_ARRAY) \

#define DAOS_OSEQ_CHK_POOL_MBS \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) Macros with complex values should be enclosed in parentheses

@@ -413,6 +413,7 @@ pool_buf_attach(struct pool_buf *buf, struct pool_component *comps,
buf->pb_domain_nr++;

buf->pb_comps[nr] = comps[0];
buf->pb_comps[nr].co_flags&= ~PO_COMPF_CHK_DONE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) spaces required around that '&=' (ctx:VxW)

@@ -413,6 +413,7 @@ pool_buf_attach(struct pool_buf *buf, struct pool_component *comps,
buf->pb_domain_nr++;

buf->pb_comps[nr] = comps[0];
buf->pb_comps[nr].co_flags&= ~PO_COMPF_CHK_DONE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
buf->pb_comps[nr].co_flags&= ~PO_COMPF_CHK_DONE;
buf->pb_comps[nr].co_flags &= ~PO_COMPF_CHK_DONE;

@@ -214,4 +217,16 @@ CRT_RPC_DECLARE(mgmt_mark, DAOS_ISEQ_MGMT_MARK, DAOS_OSEQ_MGMT_MARK)
CRT_RPC_DECLARE(mgmt_get_bs_state, DAOS_ISEQ_MGMT_GET_BS_STATE,
DAOS_OSEQ_MGMT_GET_BS_STATE)

#define DAOS_ISEQ_MGMT_TGT_SHARD_DESTROY /* input fields */ \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) Macros with complex values should be enclosed in parentheses

((int32_t) (tsdi_shard_idx) CRT_VAR) \
((uint32_t) (tsdi_padding) CRT_VAR)

#define DAOS_OSEQ_MGMT_TGT_SHARD_DESTROY /* output fields */ \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) Macros with complex values should be enclosed in parentheses

@daosbuild1
Copy link
Collaborator

@Nasf-Fan Nasf-Fan requested a review from a team as a July 5, 2022 11:12
@daosbuild1 daosbuild1 dismissed their stale review July 5, 2022 11:14

Updated patch

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -214,4 +217,16 @@ CRT_RPC_DECLARE(mgmt_mark, DAOS_ISEQ_MGMT_MARK, DAOS_OSEQ_MGMT_MARK)
CRT_RPC_DECLARE(mgmt_get_bs_state, DAOS_ISEQ_MGMT_GET_BS_STATE,
DAOS_OSEQ_MGMT_GET_BS_STATE)

#define DAOS_ISEQ_MGMT_TGT_SHARD_DESTROY /* input fields */ \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) Macros with complex values should be enclosed in parentheses

((int32_t) (tsdi_shard_idx) CRT_VAR) \
((uint32_t) (tsdi_padding) CRT_VAR)

#define DAOS_OSEQ_MGMT_TGT_SHARD_DESTROY /* output fields */ \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) Macros with complex values should be enclosed in parentheses

@daosbuild1
Copy link
Collaborator

@Nasf-Fan Nasf-Fan requested a review from daosbuild1 July 5, 2022 11:14
@daosbuild1 daosbuild1 dismissed their stale review July 5, 2022 11:19

Updated patch

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@Nasf-Fan Nasf-Fan requested review from liw and gnailzenh and removed request for a team July 7, 2022 08:37
@jolivier23 jolivier23 added the CR Catastrophic Recovery Feature label Jul 7, 2022
When DAOS check start, all involved check engines will report their
known pools' information, including the pool service replicas, pool
label and related storage allocation, to the check leader via reply.

After the pool list consolidation in the pass_1, for each pool, the
check leader will send related pool information to its pool service
leaders via new RPC - CHK_POOL_MBS.

On the check engine side, the pool service leader compares the pool
map with these information pushed from the check leader and handles
the following cases:

1. An target has some allocated storage but does not appear in the
   pool map. Under such case, the associated space will be deleted
   from the engine by default.

2. An target has some allocated storage and is marked as "DOWN" or
   "DOWNOUT" in the pool map. For this case, the administrator can
   decide to either remove or leave it there.

3. An target is referenced in the pool map ("NEW", "UP", "UPIN" or
   "DRAIN"), but no storage is actually allocated on this engine.
   Under such case, the entry for the target in the pool map will
   be marked as "DOWN" (for the "UP", "UPIN" or "DRAIN" entry) or
   "DOWNOUT" (for the "NEW" entry).

Temporarily skip code format check against src/chk/chk_internal.h
and src/mgmt/rpc.h to avoid fake warning messages.

Signed-off-by: Fan Yong <[email protected]>
Base automatically changed from DAOS-9599_1 to feature/cat_recovery July 11, 2022 12:45
@mjmac mjmac requested a review from a team as a code owner July 11, 2022 12:45
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@Nasf-Fan Nasf-Fan requested review from liuxuezhao and removed request for a team July 11, 2022 13:36
};

/* Pool service */
struct pool_svc {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a significantly strong reason, I don't think we should just expose the internal pool_svc type. Could we revert these pool_svc changes and do the pool map manipulations in some new PS RPC(s) instead?

Copy link
Contributor Author

@Nasf-Fan Nasf-Fan Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have considered the pool_svc usage cases for a while, but did not found some suitable solution. One of the important reasons for exporting such structure is that all RDB related APIs need the "pool_svc" parameter to start the TX, lock, and so on. If we do not export "pool_svc", then related CHK logic have to be moved into pool/rsvc module(s) or introduce some complex (but non-general) callback, that will make related logic to be more dirty than current implementation. I do not think it is expected.

The usage "pool_svc" is not only in this patch but also in subsequent patches. For example, on the engine of PS leader, the CHK logic will do the following:

  1. Check whether current engine is the PS leader or not. If yes, then
  2. Parse all related pool shards reported information one by one, compare with its local pool map, and then do related reparation, such as refresh the pool map or destroy related pool shard, that may involve in the interaction with admin via CHK leader.
  3. Compare pool label from MBS with its local property, and repair if necessary.

In above example, the "pool_svc" is generated in the 1st step, then repeatedly used in the 2nd step, and then the 3rd step. If we do NOT export "pool_svc" in the 1st step, then means that both 2nd and 3rd steps will be wrapped into pool/rsvc, but they are quite different than other general pool/rsvc logic, looks very dirty and strange.

@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented Aug 2, 2022

It is replaced by the patch #9865

@Nasf-Fan Nasf-Fan closed this Aug 2, 2022
@Nasf-Fan Nasf-Fan deleted the DAOS-9595_1 branch August 2, 2022 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CR Catastrophic Recovery Feature
Development

Successfully merging this pull request may close these issues.

4 participants