Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-9583 proto: DAOS check protocol changes #8941

Merged
merged 1 commit into from
May 16, 2022

Conversation

Nasf-Fan
Copy link
Contributor

@Nasf-Fan Nasf-Fan commented May 5, 2022

Contain the following changes:

  1. Replace 'bool repaired' as 'int32 result' for check report.
  2. Remove uselss 'CIS_CRASHED' status for check instance.
  3. Split 'CPS_STOPPED' from 'CPS_PAUSED' for pool status.
  4. New instance status: CIS_IMPLICATED, for the case that the
    check on the engine exit because of other engine failure.
  5. New pool status: CPS_IMPLICATED, for the case that the check
    on the pool is stopped because of other pool or engine failure.

Related changes for control plane part with go will be in another patch.

Signed-off-by: Fan Yong [email protected]

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# error This file was generated by an older version of protoc-c which is incompatible with your libprotobuf-c headers. Please regenerate this file with a newer version of protoc-c.
#endif


typedef struct _Chk__CheckReport Chk__CheckReport;
typedef struct Chk__CheckReport Chk__CheckReport;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) do not add new typedefs

@daosbuild1
Copy link
Collaborator

@Nasf-Fan Nasf-Fan requested a review from mjmac May 6, 2022 10:16
Comment on lines 135 to 136
// Inconsistency reports to be displayed
repeated chk.CheckReport reports = 7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove this field, how do we show the inconsistency reports to the admin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "repeated CheckQueryPool pools = 6;" and former fields contain the check instance status, pool status, inconsistency and time summary. Does it enough for "check query"? Do you mean that we need to return the detailed information for each found and repaired inconsistency? If yes, that may be huge when we move the scan phase to the object level in the future. On the other hand, we will report every inconsistency via "report" DRPC upcall when detect, and those information need to be (or can be) recorded by the control plane. Then if the admin wants to know related things, he/she can directly get them from control plane without querying the check engine.

// negative value if failed to repair.
// positive value is for CIA_IGNORE or dryrun mode.
// It is meaningless if the action is CIA_INTERACT.
int32 result = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from field #3, action? Seems like there is some redundancy here. If not, then it seems we should set the field type to an enum to constrain the set of possible result values, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"CheckInconsistAction action = 3" means how do we (try to) handle the inconsistency, for example, remove the dangling reference, or re-reference the orphan shard. The "int32 result = 4;" means the action result (succeed, or fail for some reason).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so is it a boolean value? If so, perhaps it should be changed to bool succeeded = 4;? If not, and there is a closed set of possible values, then it should probably be an enum, right? If it is an open-ended set of possible values, how do we know what should be displayed to the user?

@Nasf-Fan Nasf-Fan force-pushed the DAOS-9583_proto branch from 4d1b335 to 692a6ad Compare May 7, 2022 09:43
@daosbuild1 daosbuild1 dismissed their stale review May 7, 2022 09:46

Updated patch

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

};
static const ProtobufCIntRange chk__check_inst_status__value_ranges[] = {
{0, 0},{0, 7}
{0, 0},{0, 6}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) space required after that ',' (ctx:VxV)

};
static const ProtobufCIntRange chk__check_pool_status__value_ranges[] = {
{0, 0},{0, 6}
{0, 0},{0, 7}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) space required after that ',' (ctx:VxV)

# error This file was generated by an older version of protoc-c which is incompatible with your libprotobuf-c headers. Please regenerate this file with a newer version of protoc-c.
#endif


typedef struct _Chk__CheckReport Chk__CheckReport;
typedef struct Chk__CheckReport Chk__CheckReport;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) do not add new typedefs

CHK__CHECK_POOL_STATUS__CPS_PENDING = 5
CHK__CHECK_POOL_STATUS__CPS_PENDING = 5,
/*
* DAOS check on the pool has been stoppped explicitly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* DAOS check on the pool has been stoppped explicitly.
* DAOS check on the pool has been stopped explicitly.

CPS_PENDING = 5; // Waiting for the decision from the admin.
CPS_STOPPED = 6; // DAOS check on the pool has been stoppped explicitly.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CPS_STOPPED = 6; // DAOS check on the pool has been stoppped explicitly.
CPS_STOPPED = 6; // DAOS check on the pool has been stopped explicitly.

@daosbuild1
Copy link
Collaborator

@Nasf-Fan Nasf-Fan force-pushed the DAOS-9583_proto branch from 692a6ad to d526e0c Compare May 9, 2022 12:53
@daosbuild1 daosbuild1 dismissed their stale review May 9, 2022 12:56

Updated patch

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

};
static const ProtobufCIntRange chk__check_pool_status__value_ranges[] = {
{0, 0},{0, 6}
{0, 0},{0, 8}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) space required after that ',' (ctx:VxV)

# error This file was generated by an older version of protoc-c which is incompatible with your libprotobuf-c headers. Please regenerate this file with a newer version of protoc-c.
#endif


typedef struct _Chk__CheckReport Chk__CheckReport;
typedef struct Chk__CheckReport Chk__CheckReport;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(style) do not add new typedefs

@daosbuild1
Copy link
Collaborator

Contain the following changes:

1. Replace 'bool repaired' as 'int32 result' for check report.
2. Remove uselss 'CIS_CRASHED' status for check instance.
3. Split 'CPS_STOPPED' from 'CPS_PAUSED' for pool status.
4. New instance status: CIS_IMPLICATED, for the case that the
   check on the engine exit because of other engine failure.
5. New pool status: CPS_IMPLICATED, for the case that the check
   on the pool is stopped because of other pool or engine failure.

Related changes for control plane part with go will be in another patch.

Signed-off-by: Fan Yong <[email protected]>
@Nasf-Fan
Copy link
Contributor Author

@mjmac , would you please review the patch and consider adjust control part related logic? Thanks!

@mjmac
Copy link
Contributor

mjmac commented May 12, 2022

@mjmac , would you please review the patch and consider adjust control part related logic? Thanks!

I still don't understand why you want to remove the reports array. I need them for the MS -> dmg part of the query response in order to show the admin which reports need a decision to be made. You can play with this on the branch with the following process:

  1. create a server with at least 1 engine joined
  2. dmg system stop to stop all ranks
  3. dmg system start --check to enable checker mode (this will be improved)
  4. run dmg faults add-checker-report -c CIC_POOL_BAD_LABEL 3 times to generate 3 synthetic inconsistency reports that are stored in the MS
  5. run dmg check query

You should see something like the following output:

$ dmg check query
Current phase: CSP_PREPARE
Inconsistency Reports:
  0xada4bb2122e2b19d CIC_POOL_BAD_LABEL: The pool label for 86218043-105a-4c95-a03b-86c0ae73efaf does not match MS
  Potential resolution actions:
    5: Trust the MS pool entry (ms-label) for 86218043-105a-4c95-a03b-86c0ae73efaf
    6: Trust the PS pool entry (ps-label) for 86218043-105a-4c95-a03b-86c0ae73efaf
    2: Ignore the pool finding

  0xc3f8bb01234a9e55 CIC_POOL_BAD_LABEL: The pool label for ea1a46e3-27f8-4a94-aad8-d003f70e2844 does not match MS
  Potential resolution actions:
    5: Trust the MS pool entry (ms-label) for ea1a46e3-27f8-4a94-aad8-d003f70e2844
    6: Trust the PS pool entry (ps-label) for ea1a46e3-27f8-4a94-aad8-d003f70e2844
    2: Ignore the pool finding

In this example output, the "ms-label" and "ps-label" are just made up strings. In reality, these would be provided in the details array of the report and would be filled in by the engine checker module that is filing the inconsistency report.

If we remove the "useless" reports array, then I will have no way to display these reports to the admin. Unless you have some other idea in mind.

@Nasf-Fan
Copy link
Contributor Author

Nasf-Fan commented May 12, 2022

@mjmac , would you please review the patch and consider adjust control part related logic? Thanks!

I still don't understand why you want to remove the reports array. I need them for the MS -> dmg part of the query response in order to show the admin which reports need a decision to be made. You can play with this on the branch with the following process:

  1. create a server with at least 1 engine joined
  2. dmg system stop to stop all ranks
  3. dmg system start --check to enable checker mode (this will be improved)
  4. run dmg faults add-checker-report -c CIC_POOL_BAD_LABEL 3 times to generate 3 synthetic inconsistency reports that are stored in the MS
  5. run dmg check query

You should see something like the following output:

$ dmg check query
Current phase: CSP_PREPARE
Inconsistency Reports:
  0xada4bb2122e2b19d CIC_POOL_BAD_LABEL: The pool label for 86218043-105a-4c95-a03b-86c0ae73efaf does not match MS
  Potential resolution actions:
    5: Trust the MS pool entry (ms-label) for 86218043-105a-4c95-a03b-86c0ae73efaf
    6: Trust the PS pool entry (ps-label) for 86218043-105a-4c95-a03b-86c0ae73efaf
    2: Ignore the pool finding

  0xc3f8bb01234a9e55 CIC_POOL_BAD_LABEL: The pool label for ea1a46e3-27f8-4a94-aad8-d003f70e2844 does not match MS
  Potential resolution actions:
    5: Trust the MS pool entry (ms-label) for ea1a46e3-27f8-4a94-aad8-d003f70e2844
    6: Trust the PS pool entry (ps-label) for ea1a46e3-27f8-4a94-aad8-d003f70e2844
    2: Ignore the pool finding

In this example output, the "ms-label" and "ps-label" are just made up strings. In reality, these would be provided in the details array of the report and would be filled in by the engine checker module that is filing the inconsistency report.

If we remove the "useless" reports array, then I will have no way to display these reports to the admin. Unless you have some other idea in mind.

Once "dmg system start --check" is triggered, the DAOS check instance will run on all related engines. And when some insosnsitency is detected, related check engine will report such inconsistency to the check leader via "CHK_REPORT" RPC. And then the check leader will forward such inconsistency report to the control plane via "dss_drpc_call(DRPC_METHOD_CHK_REPORT...)". If related check engine does not know how to repair such inconsistency, then the DRPC upcall will also contains the necessary information for the interaction with admin. So the control plane can record the report things locally and show them to the admin when "dmg check query" is triggered. Does it works?

On the other hand, when "dmg check query" is triggered, the control plane will downcall the check leader, then check leader will reply some summary information about the check: the check instance status (running, failed, paused, and so on), related pool(s) status (checked, pending, checked, failed, and so on), how many inconsistency are found/repaired/ignored, estimated remaining time, and so on. But the former report (via DRPC upcall) detailed inconsistency information will NOT be replied again by the check leader since the control plane has already known them.

@mjmac
Copy link
Contributor

mjmac commented May 12, 2022

If related check engine does not know how to repair such inconsistency, then the DRPC upcall will also contains the necessary information for the interaction with admin. So the control plane can record the report things locally and show them to the admin when "dmg check query" is triggered. Does it works?

Yes, that is how the current design works. As the check engine finds inconsistencies, they are sent to the control plane Management Service (MS). The MS stores the reports in its database. In the case of inconsistencies that require input from the admin (CIA_INTERACT), the default behavior for dmg check query is that it only shows those reports in the format I proposed above. The admin may then choose an action for each report, e.g. dmg check repair 0xc3f8bb01234a9e55 5 to choose "Trust the MS entry" for an inconsistent label finding.

As an aside, I know that the choice numbers should be fixed (e.g. 1, 2, 3) -- I just need to create a map from the UI choices to the action enum values. The report seqnum is kind of ugly as an identifier, but I'm not sure if there is a better way to handle it, as the ID space is uint64.

On the other hand, when "dmg check query" is triggered, the control plane will downcall the check leader, then check leader will reply some summary information about the check: the check instance status (running, failed, paused, and so on), related pool(s) status (checked, pending, checked, failed, and so on), how many inconsistency are found/repaired/ignored, estimated remaining time, and so on. But the former report (via DRPC upcall) detailed inconsistency information will NOT be replied again by the check leader since the control plane has already known them.

Yes, I understand. I do not expect the check engine to send all reports in response to the query dRPC. As you said, the MS is responsible for holding all of the reports and will determine which should be sent in the response to dmg.

To be clear, I am intending that we use the CheckQueryResp message in two places:

  1. When the control plane calls DRPC_METHOD_MGMT_CHK_QUERY on the checker engine, it receives CheckQueryResp with summary information about the state of the checker. This does not include anything in the reports array.
  2. When dmg check query is run, the MS calls DRPC_METHOD_MGMT_CHK_QUERY on the checker engine and receives CheckQueryResp. Then the MS fills in the reports array before sending the response to dmg. This way, we can show the admin information about the reports that have been sent by the checker. My plan is to only show CIA_INTERACT reports by default, so that the admin can quickly learn which inconsistencies are waiting for a decision. We can add future capabilities to return all reports or otherwise do some filtering on the returned reports so that the admin can choose what they want to see.

Does that make sense?

@Nasf-Fan
Copy link
Contributor Author

If related check engine does not know how to repair such inconsistency, then the DRPC upcall will also contains the necessary information for the interaction with admin. So the control plane can record the report things locally and show them to the admin when "dmg check query" is triggered. Does it works?

Yes, that is how the current design works. As the check engine finds inconsistencies, they are sent to the control plane Management Service (MS). The MS stores the reports in its database. In the case of inconsistencies that require input from the admin (CIA_INTERACT), the default behavior for dmg check query is that it only shows those reports in the format I proposed above. The admin may then choose an action for each report, e.g. dmg check repair 0xc3f8bb01234a9e55 5 to choose "Trust the MS entry" for an inconsistent label finding.

That may work, but I am not sure whether it is efficient enough or not. Because if the admin does not perform "dmg check query", they will not know and will not be notified that there are some interaction waiting for his/her feedback, that may cause related check engine(s) to be blocked for very long time.

A possible enhancement is that when check leader report some inconsistency with interaction requirement, the control plane can print related interaction message on the console, so the admin can easily to see there are something to be handled by him/her. And then the admin can use "dmg check query" to show more information as you described above and make the choices for related interaction reports.

As an aside, I know that the choice numbers should be fixed (e.g. 1, 2, 3) -- I just need to create a map from the UI choices to the action enum values. The report seqnum is kind of ugly as an identifier, but I'm not sure if there is a better way to handle it, as the ID space is uint64.

The 64-bits sequence number is the unique identifier used by the check engine to locate the found inconsistency. It is generated by the check leader engine and pass to the control plan via check report DRPC upcall, and when control plane triggers "DRPC_METHOD_MGMT_CHK_ACT" RPC for "dmg check act" after got choice from admin for some interaction, it needs to take related sequence number to the check leader.

On the other hand, such 64-bits sequence number is between control plane and check leader engine. It is NOT necessary to show it to the admin if you think it is not friendly for admin to understand. The control plane can make some map between such sequence number and something shown to the admin. But that will introduce more work for your side.

On the other hand, when "dmg check query" is triggered, the control plane will downcall the check leader, then check leader will reply some summary information about the check: the check instance status (running, failed, paused, and so on), related pool(s) status (checked, pending, checked, failed, and so on), how many inconsistency are found/repaired/ignored, estimated remaining time, and so on. But the former report (via DRPC upcall) detailed inconsistency information will NOT be replied again by the check leader since the control plane has already known them.

Yes, I understand. I do not expect the check engine to send all reports in response to the query dRPC. As you said, the MS is responsible for holding all of the reports and will determine which should be sent in the response to dmg.

To be clear, I am intending that we use the CheckQueryResp message in two places:

  1. When the control plane calls DRPC_METHOD_MGMT_CHK_QUERY on the checker engine, it receives CheckQueryResp with summary information about the state of the checker. This does not include anything in the reports array.

Yes.

  1. When dmg check query is run, the MS calls DRPC_METHOD_MGMT_CHK_QUERY on the checker engine and receives CheckQueryResp. Then the MS fills in the reports array before sending the response to dmg. This way, we can show the admin information about the reports that have been sent by the checker. My plan is to only show CIA_INTERACT reports by default, so that the admin can quickly learn which inconsistencies are waiting for a decision. We can add future capabilities to return all reports or otherwise do some filtering on the returned reports so that the admin can choose what they want to see.

Does that make sense?

I think we can support more options to control the output to the admin for "dmg check query". For example:

  1. "dmg check query" only shows to the admin with the summary information from DRPC_METHOD_MGMT_CHK_QUERY RPC reply.
  2. "dmg check query -d" will additional shows the reports that need interaction with the admin.
  3. "dmg check query -dd" will additional shows all the reported inconsistencies.

And in the future, we may add more options to allow only show/filter specified inconsistencies reports. Anyway, that is the long-time plan.

@mjmac
Copy link
Contributor

mjmac commented May 13, 2022

@Nasf-Fan: I think we are mostly in agreement now. Can we agree to disagree on the reports array for now and leave it alone? If we really don't need it, we can remove it later. For the moment, though, I would prefer to keep it so that my current implementation doesn't break. It should have no effect on your work. The engine does not need to concern itself with that field in the message.

@Nasf-Fan
Copy link
Contributor Author

@Nasf-Fan: I think we are mostly in agreement now. Can we agree to disagree on the reports array for now and leave it alone? If we really don't need it, we can remove it later. For the moment, though, I would prefer to keep it so that my current implementation doesn't break. It should have no effect on your work. The engine does not need to concern itself with that field in the message.

Sure, we can keep it, but the CHK_QUREY RPC reply will contains empty "reports". I will refresh my patch to only contains the other new added elements.

@Nasf-Fan Nasf-Fan requested a review from mjmac May 14, 2022 17:39
@daosbuild1 daosbuild1 dismissed their stale review May 14, 2022 17:42

Updated patch

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@mjmac mjmac merged commit 15f29e5 into feature/cat_recovery May 16, 2022
@mjmac mjmac deleted the DAOS-9583_proto branch May 16, 2022 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants