-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-9583 proto: DAOS check protocol changes #8941
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style warning(s) for job https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8941/1/
Please review https://wiki.hpdd.intel.com/display/DC/Coding+Rules
# error This file was generated by an older version of protoc-c which is incompatible with your libprotobuf-c headers. Please regenerate this file with a newer version of protoc-c. | ||
#endif | ||
|
||
|
||
typedef struct _Chk__CheckReport Chk__CheckReport; | ||
typedef struct Chk__CheckReport Chk__CheckReport; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(style) do not add new typedefs
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8941/1/execution/node/141/log |
src/proto/mgmt/check.proto
Outdated
// Inconsistency reports to be displayed | ||
repeated chk.CheckReport reports = 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove this field, how do we show the inconsistency reports to the admin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "repeated CheckQueryPool pools = 6;" and former fields contain the check instance status, pool status, inconsistency and time summary. Does it enough for "check query"? Do you mean that we need to return the detailed information for each found and repaired inconsistency? If yes, that may be huge when we move the scan phase to the object level in the future. On the other hand, we will report every inconsistency via "report" DRPC upcall when detect, and those information need to be (or can be) recorded by the control plane. Then if the admin wants to know related things, he/she can directly get them from control plane without querying the check engine.
// negative value if failed to repair. | ||
// positive value is for CIA_IGNORE or dryrun mode. | ||
// It is meaningless if the action is CIA_INTERACT. | ||
int32 result = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from field #3, action
? Seems like there is some redundancy here. If not, then it seems we should set the field type to an enum to constrain the set of possible result values, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"CheckInconsistAction action = 3" means how do we (try to) handle the inconsistency, for example, remove the dangling reference, or re-reference the orphan shard. The "int32 result = 4;" means the action result (succeed, or fail for some reason).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so is it a boolean value? If so, perhaps it should be changed to bool succeeded = 4;
? If not, and there is a closed set of possible values, then it should probably be an enum, right? If it is an open-ended set of possible values, how do we know what should be displayed to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style warning(s) for job https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8941/2/
Please review https://wiki.hpdd.intel.com/display/DC/Coding+Rules
src/chk/chk.pb-c.c
Outdated
}; | ||
static const ProtobufCIntRange chk__check_inst_status__value_ranges[] = { | ||
{0, 0},{0, 7} | ||
{0, 0},{0, 6} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(style) space required after that ',' (ctx:VxV)
src/chk/chk.pb-c.c
Outdated
}; | ||
static const ProtobufCIntRange chk__check_pool_status__value_ranges[] = { | ||
{0, 0},{0, 6} | ||
{0, 0},{0, 7} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(style) space required after that ',' (ctx:VxV)
# error This file was generated by an older version of protoc-c which is incompatible with your libprotobuf-c headers. Please regenerate this file with a newer version of protoc-c. | ||
#endif | ||
|
||
|
||
typedef struct _Chk__CheckReport Chk__CheckReport; | ||
typedef struct Chk__CheckReport Chk__CheckReport; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(style) do not add new typedefs
src/chk/chk.pb-c.h
Outdated
CHK__CHECK_POOL_STATUS__CPS_PENDING = 5 | ||
CHK__CHECK_POOL_STATUS__CPS_PENDING = 5, | ||
/* | ||
* DAOS check on the pool has been stoppped explicitly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* DAOS check on the pool has been stoppped explicitly. | |
* DAOS check on the pool has been stopped explicitly. |
src/proto/chk/chk.proto
Outdated
CPS_PENDING = 5; // Waiting for the decision from the admin. | ||
CPS_STOPPED = 6; // DAOS check on the pool has been stoppped explicitly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPS_STOPPED = 6; // DAOS check on the pool has been stoppped explicitly. | |
CPS_STOPPED = 6; // DAOS check on the pool has been stopped explicitly. |
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8941/2/execution/node/141/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style warning(s) for job https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-8941/3/
Please review https://wiki.hpdd.intel.com/display/DC/Coding+Rules
}; | ||
static const ProtobufCIntRange chk__check_pool_status__value_ranges[] = { | ||
{0, 0},{0, 6} | ||
{0, 0},{0, 8} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(style) space required after that ',' (ctx:VxV)
# error This file was generated by an older version of protoc-c which is incompatible with your libprotobuf-c headers. Please regenerate this file with a newer version of protoc-c. | ||
#endif | ||
|
||
|
||
typedef struct _Chk__CheckReport Chk__CheckReport; | ||
typedef struct Chk__CheckReport Chk__CheckReport; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(style) do not add new typedefs
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-8941/3/execution/node/141/log |
Contain the following changes: 1. Replace 'bool repaired' as 'int32 result' for check report. 2. Remove uselss 'CIS_CRASHED' status for check instance. 3. Split 'CPS_STOPPED' from 'CPS_PAUSED' for pool status. 4. New instance status: CIS_IMPLICATED, for the case that the check on the engine exit because of other engine failure. 5. New pool status: CPS_IMPLICATED, for the case that the check on the pool is stopped because of other pool or engine failure. Related changes for control plane part with go will be in another patch. Signed-off-by: Fan Yong <[email protected]>
@mjmac , would you please review the patch and consider adjust control part related logic? Thanks! |
I still don't understand why you want to remove the reports array. I need them for the MS -> dmg part of the query response in order to show the admin which reports need a decision to be made. You can play with this on the branch with the following process:
You should see something like the following output:
In this example output, the "ms-label" and "ps-label" are just made up strings. In reality, these would be provided in the details array of the report and would be filled in by the engine checker module that is filing the inconsistency report. If we remove the "useless" reports array, then I will have no way to display these reports to the admin. Unless you have some other idea in mind. |
Once "dmg system start --check" is triggered, the DAOS check instance will run on all related engines. And when some insosnsitency is detected, related check engine will report such inconsistency to the check leader via "CHK_REPORT" RPC. And then the check leader will forward such inconsistency report to the control plane via "dss_drpc_call(DRPC_METHOD_CHK_REPORT...)". If related check engine does not know how to repair such inconsistency, then the DRPC upcall will also contains the necessary information for the interaction with admin. So the control plane can record the report things locally and show them to the admin when "dmg check query" is triggered. Does it works? On the other hand, when "dmg check query" is triggered, the control plane will downcall the check leader, then check leader will reply some summary information about the check: the check instance status (running, failed, paused, and so on), related pool(s) status (checked, pending, checked, failed, and so on), how many inconsistency are found/repaired/ignored, estimated remaining time, and so on. But the former report (via DRPC upcall) detailed inconsistency information will NOT be replied again by the check leader since the control plane has already known them. |
Yes, that is how the current design works. As the check engine finds inconsistencies, they are sent to the control plane Management Service (MS). The MS stores the reports in its database. In the case of inconsistencies that require input from the admin (CIA_INTERACT), the default behavior for As an aside, I know that the choice numbers should be fixed (e.g. 1, 2, 3) -- I just need to create a map from the UI choices to the action enum values. The report seqnum is kind of ugly as an identifier, but I'm not sure if there is a better way to handle it, as the ID space is uint64.
Yes, I understand. I do not expect the check engine to send all reports in response to the query dRPC. As you said, the MS is responsible for holding all of the reports and will determine which should be sent in the response to dmg. To be clear, I am intending that we use the
Does that make sense? |
That may work, but I am not sure whether it is efficient enough or not. Because if the admin does not perform "dmg check query", they will not know and will not be notified that there are some interaction waiting for his/her feedback, that may cause related check engine(s) to be blocked for very long time. A possible enhancement is that when check leader report some inconsistency with interaction requirement, the control plane can print related interaction message on the console, so the admin can easily to see there are something to be handled by him/her. And then the admin can use "dmg check query" to show more information as you described above and make the choices for related interaction reports.
The 64-bits sequence number is the unique identifier used by the check engine to locate the found inconsistency. It is generated by the check leader engine and pass to the control plan via check report DRPC upcall, and when control plane triggers "DRPC_METHOD_MGMT_CHK_ACT" RPC for "dmg check act" after got choice from admin for some interaction, it needs to take related sequence number to the check leader. On the other hand, such 64-bits sequence number is between control plane and check leader engine. It is NOT necessary to show it to the admin if you think it is not friendly for admin to understand. The control plane can make some map between such sequence number and something shown to the admin. But that will introduce more work for your side.
Yes.
I think we can support more options to control the output to the admin for "dmg check query". For example:
And in the future, we may add more options to allow only show/filter specified inconsistencies reports. Anyway, that is the long-time plan. |
@Nasf-Fan: I think we are mostly in agreement now. Can we agree to disagree on the |
Sure, we can keep it, but the CHK_QUREY RPC reply will contains empty "reports". I will refresh my patch to only contains the other new added elements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Contain the following changes:
check on the engine exit because of other engine failure.
on the pool is stopped because of other pool or engine failure.
Related changes for control plane part with go will be in another patch.
Signed-off-by: Fan Yong [email protected]