-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14181 control,bio,mgmt: Return NVMe details over dRPC #13382
DAOS-14181 control,bio,mgmt: Return NVMe details over dRPC #13382
Conversation
Bug-tracker data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/1/testReport/ |
Features: control Required-githooks: true Signed-off-by: Tom Nabarro <[email protected]>
bba5921
to
00e5ca5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
}) | ||
} | ||
} | ||
//func TestServer_CtlSvc_adjustNvmeSize(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be removed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be in the follow-up.
}) | ||
} | ||
} | ||
//func TestServer_CtlSvc_StorageScan_PostEngineStart(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be removed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove in the follow-up, I'm leaving commented to remind me to make sure everything is covered.
00e5ca5
to
87d69cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Notes to reviewers: NVMe controller info is returned in SMD list dRPC query and protobuf and native go structs link a SMD with a controller and a single namespace. The changes are in the following areas:
As a guidance so as not to burden anyone with too much to review I suggest the following (feel free to review more): The PR doesn't introduce any outside visible change and attempts keeping parity with the existing functionality whilst adding info to the dRPC SMD list devs messages that isn't yet consumed in the control plane. The following PR will remove the bdev scan cache and instead update live over dRPC. A number of go unit tests are removed in src/control/server/ctl_storage_rpc_test.go and will be reinstated or replaced during the bdev scan removal in the following PR. It didn't make sense to first fix up a large number of tests just to then remove them. |
87d69cf
to
729d730
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/4/testReport/ |
Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13382/4/execution/node/1162/log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly OK for what I understand.
Just a couple of questions.
@@ -119,6 +119,8 @@ struct mgmt_bio_health { | |||
|
|||
int ds_mgmt_bio_health_query(struct mgmt_bio_health *mbh, uuid_t uuid); | |||
int ds_mgmt_smd_list_devs(Ctl__SmdDevResp *resp); | |||
void |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mandated by clang-format, not going to fight the linter we have in place to make formatting consistent
return -DER_INVAL; | ||
} | ||
|
||
len = strnlen(src, NVME_DETAIL_BUFLEN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return value of strnlen()
should be tested.
len = strnlen(src, NVME_DETAIL_BUFLEN); | |
if (len = strnlen(src, NVME_DETAIL_BUFLEN)) >= NVME_DETAIL_BUFLEN) { | |
D_ERROR("attempting to copy an invalid source"); | |
return -DER_INVAL; | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would you test the implementation of a c built-in function? that's already tested by the C library It is a part of. You have to be able to rely on the functionality of the c lib functions being correct IMO. strnlen will only return the maximum of the second parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, if len == NVME_DETAIL_BUFLEN, then it means that src is too long and thus some part of it will not be copied at line 315.
However, if we are sure that src is always correct, then I got not issue to test len or even use strlen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I understand your comment now, fixed
|
||
return 0; | ||
} | ||
|
||
int | ||
ds_mgmt_smd_list_devs(Ctl__SmdDevResp *resp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT, this function is quite big; it would help to understand and maintain it, if it was refactorize in smaller functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will refactor
for _, sd := range listDevsResp.Devices { | ||
if sd != nil { | ||
rResp.Devices = append(rResp.Devices, sd) | ||
//&ctlpb.SmdQueryResp_SmdDeviceWithHealth{Details: sd}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To remove ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed in PR-13385, done
src/bio/bio_config.c
Outdated
@@ -376,11 +385,12 @@ load_vmd_subsystem_config(struct json_config_ctx *ctx, bool *vmd_enabled) | |||
|
|||
D_ASSERT(ctx->config_it != NULL); | |||
D_ASSERT(vmd_enabled != NULL); | |||
D_ASSERT(*vmd_enabled == false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an expert, but not understand why VMD could be enabled in only one VMD subsystem.
I would expect the opposite to be enabled for all or nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just Ensuring the correct flow, at this point the vmd_enabled should not have been set yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the call of this function in check_vmd_status() is done in a loop, I was expecting that we could find the NVME_CONF_ENABLE_VMD environment variable set in different subsystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should only be one vmd subsystem in a config but I don't think there is any need to put a restriction on that in this code so I've removed this check, thanks
|
||
if (copy_ascii(cdst->fw_rev, sizeof(cdst->fw_rev), cdata->fr, | ||
sizeof(cdata->fr)) != 0) | ||
len = sizeof(src); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure to understand how the sizeof could retrieve the length of the array holds by src.
In this case it will only returns the size of a void* pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a bug, thanks, fixed
…move-bdev-scan-cache-pt1-proto
Required-githooks: true Signed-off-by: Tom Nabarro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/5/testReport/ |
…move-bdev-scan-cache-pt1-proto Required-githooks: true
Features: control Required-githooks: true Signed-off-by: Tom Nabarro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go changes look good. A couple questions on C stuff.
if (copy_ascii(*dst, len + 1, src, len) != 0) { | ||
perror("copy_ascii"); | ||
return -NVMEC_ERR_CHK_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's a static function with a pretty limited scope, that seems okay to me. It makes it a bit harder to keep track mentally of the state of the memory, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C parts looks good for me. ^_^
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13382/17/execution/node/1378/log |
Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/17/testReport/ |
https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13382/17/tests/ build failed with known failures:
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13382/18/execution/node/354/log |
Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/18/testReport/ |
https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13382/18/tests/ build failed with known failures:
|
@NiuYawei @daos-stack/daos-gatekeeper can this PR be force landed please, the consistent known failures identified in the comments are expected when running with Features: control Commit pragma. |
@phender has advised me (after direct questioning) that there are @daos-stack/daos-gatekeeper concerns regarding lack of testing in this PR. First of all why has this not been commented in the PR that this is a concern after 5 days? This should have been brought up earlier and transparently communicated. Please can we have more transparency in gatekeeping to make the process more efficient. Thanks in advance. With regards to testing, the PR has extensive unit test updates and storage scan is effectively covered in functional tests. |
I am on the gatekeeping channel, and I asked a question there if testing was enough for this PR. Im not sure how this was translated as a concern that there is not enough testing. I do not have enough knowledge of this code base and it was just a genuine question to other gatekeepers who might be more familiar. anyway this was just a question and not a concern. also please consider breaking PRs into smaller ones; otherwise you are getting more into the feature branch territory ;-) |
The tests have already been identified as known issues that exist within the subset run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BIO change looks good to me. Sorry, I overlooked prior comments.
…move-bdev-scan-cache-pt1-proto Features: control Signed-off-by: Tom Nabarro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
I have triggered tests again, there are no code changes in this PR between CI runs 17-19 and the failures are the same known issues as mentioned in previous comments. I believe the PR can be force landed based on CI test coverage, unit test coverage and review approvals. TIA |
@mchaarawi can we please land this PR now we have an extensive list of review approvals and both https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13382/17/pipeline and https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13382/18/pipeline only failed on known issues that are expected to occur when |
Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/19/testReport/ |
Enable fetching of NVMe controller (SSD) details over dRPC. This is
required to get updated SPDK discovery results after an NVMe SSD is
hotplugged as the newly added device will be claimed by the engine.
Once claimed the device cannot be accessed by the control-plane.
This change also enables the reduction of complexity in the
control-plane by moving to a position where the bdev scan cache,
which was previously implemented to mitigate the situation described
above, can be removed. This removal will be performed in a subsequent
change.
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: