Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14181 control,bio,mgmt: Return NVMe details over dRPC #13382

Merged
merged 13 commits into from
Dec 13, 2023

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Nov 23, 2023

Enable fetching of NVMe controller (SSD) details over dRPC. This is
required to get updated SPDK discovery results after an NVMe SSD is
hotplugged as the newly added device will be claimed by the engine.
Once claimed the device cannot be accessed by the control-plane.

This change also enables the reduction of complexity in the
control-plane by moving to a position where the bdev scan cache,
which was previously implemented to mitigate the situation described
above, can be removed. This removal will be performed in a subsequent
change.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@tanabarr tanabarr self-assigned this Nov 23, 2023
Copy link

github-actions bot commented Nov 23, 2023

Bug-tracker data:
Ticket title is 'Remove bdev scan cache'
Status is 'In Review'
Labels: 'SPDK,drpc'
Errors are Unknown component
https://daosio.atlassian.net/browse/DAOS-14181

@tanabarr tanabarr requested a review from knard38 November 23, 2023 02:08
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/1/testReport/

Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr force-pushed the tanabarr/control-remove-bdev-scan-cache-pt1-proto branch from bba5921 to 00e5ca5 Compare November 23, 2023 16:23
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr tanabarr marked this pull request as ready for review November 23, 2023 16:28
@tanabarr tanabarr requested a review from a team as a code owner November 23, 2023 16:28
@tanabarr tanabarr requested review from kjacque and removed request for a team November 23, 2023 16:28
})
}
}
//func TestServer_CtlSvc_adjustNvmeSize(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be in the follow-up.

})
}
}
//func TestServer_CtlSvc_StorageScan_PostEngineStart(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove in the follow-up, I'm leaving commented to remind me to make sure everything is covered.

@tanabarr tanabarr force-pushed the tanabarr/control-remove-bdev-scan-cache-pt1-proto branch from 00e5ca5 to 87d69cf Compare November 23, 2023 16:29
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr
Copy link
Contributor Author

tanabarr commented Nov 23, 2023

Notes to reviewers:

NVMe controller info is returned in SMD list dRPC query and protobuf and native go structs link a SMD with a controller and a single namespace. The changes are in the following areas:

  1. control-plane SPDK bindings src/control/lib/spdk
  2. drpc smd-list stack in engine mgmt and bio modules
  3. drpc smd-list stack in control-plane

As a guidance so as not to burden anyone with too much to review I suggest the following (feel free to review more):
@NiuYawei and @wangshilong could you please review 2 and optionally 1.
@kjacque could you please review 1 and 3.
@mjmac and @knard-intel could you please review 3.

The PR doesn't introduce any outside visible change and attempts keeping parity with the existing functionality whilst adding info to the dRPC SMD list devs messages that isn't yet consumed in the control plane. The following PR will remove the bdev scan cache and instead update live over dRPC.

A number of go unit tests are removed in src/control/server/ctl_storage_rpc_test.go and will be reinstated or replaced during the bdev scan removal in the following PR. It didn't make sense to first fix up a large number of tests just to then remove them.

@tanabarr tanabarr force-pushed the tanabarr/control-remove-bdev-scan-cache-pt1-proto branch from 87d69cf to 729d730 Compare November 23, 2023 16:40
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/4/testReport/

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13382/4/execution/node/1162/log

Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly OK for what I understand.
Just a couple of questions.

@@ -119,6 +119,8 @@ struct mgmt_bio_health {

int ds_mgmt_bio_health_query(struct mgmt_bio_health *mbh, uuid_t uuid);
int ds_mgmt_smd_list_devs(Ctl__SmdDevResp *resp);
void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mandated by clang-format, not going to fight the linter we have in place to make formatting consistent

return -DER_INVAL;
}

len = strnlen(src, NVME_DETAIL_BUFLEN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value of strnlen() should be tested.

Suggested change
len = strnlen(src, NVME_DETAIL_BUFLEN);
if (len = strnlen(src, NVME_DETAIL_BUFLEN)) >= NVME_DETAIL_BUFLEN) {
D_ERROR("attempting to copy an invalid source");
return -DER_INVAL;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would you test the implementation of a c built-in function? that's already tested by the C library It is a part of. You have to be able to rely on the functionality of the c lib functions being correct IMO. strnlen will only return the maximum of the second parameter.

Copy link
Contributor

@knard38 knard38 Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, if len == NVME_DETAIL_BUFLEN, then it means that src is too long and thus some part of it will not be copied at line 315.
However, if we are sure that src is always correct, then I got not issue to test len or even use strlen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I understand your comment now, fixed


return 0;
}

int
ds_mgmt_smd_list_devs(Ctl__SmdDevResp *resp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT, this function is quite big; it would help to understand and maintain it, if it was refactorize in smaller functions.

Copy link
Contributor Author

@tanabarr tanabarr Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will refactor

for _, sd := range listDevsResp.Devices {
if sd != nil {
rResp.Devices = append(rResp.Devices, sd)
//&ctlpb.SmdQueryResp_SmdDeviceWithHealth{Details: sd})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in PR-13385, done

@@ -376,11 +385,12 @@ load_vmd_subsystem_config(struct json_config_ctx *ctx, bool *vmd_enabled)

D_ASSERT(ctx->config_it != NULL);
D_ASSERT(vmd_enabled != NULL);
D_ASSERT(*vmd_enabled == false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not an expert, but not understand why VMD could be enabled in only one VMD subsystem.
I would expect the opposite to be enabled for all or nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just Ensuring the correct flow, at this point the vmd_enabled should not have been set yet

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the call of this function in check_vmd_status() is done in a loop, I was expecting that we could find the NVME_CONF_ENABLE_VMD environment variable set in different subsystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should only be one vmd subsystem in a config but I don't think there is any need to put a restriction on that in this code so I've removed this check, thanks


if (copy_ascii(cdst->fw_rev, sizeof(cdst->fw_rev), cdata->fr,
sizeof(cdata->fr)) != 0)
len = sizeof(src);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to understand how the sizeof could retrieve the length of the array holds by src.
In this case it will only returns the size of a void* pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a bug, thanks, fixed

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/5/testReport/

…move-bdev-scan-cache-pt1-proto

Required-githooks: true
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr tanabarr requested a review from wangshilong December 4, 2023 17:34
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go changes look good. A couple questions on C stuff.

src/mgmt/srv_query.c Show resolved Hide resolved
Comment on lines +271 to 273
if (copy_ascii(*dst, len + 1, src, len) != 0) {
perror("copy_ascii");
return -NVMEC_ERR_CHK_SIZE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a static function with a pretty limited scope, that seems okay to me. It makes it a bit harder to keep track mentally of the state of the memory, though.

@tanabarr tanabarr requested a review from kjacque December 5, 2023 11:49
Copy link
Contributor

@wangshilong wangshilong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C parts looks good for me. ^_^

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13382/17/execution/node/1378/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/17/testReport/

@tanabarr
Copy link
Contributor Author

tanabarr commented Dec 7, 2023

@tanabarr tanabarr requested a review from a team December 7, 2023 14:37
@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Dec 7, 2023
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-13382/18/execution/node/354/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/18/testReport/

@tanabarr
Copy link
Contributor Author

@tanabarr
Copy link
Contributor Author

@NiuYawei @daos-stack/daos-gatekeeper can this PR be force landed please, the consistent known failures identified in the comments are expected when running with Features: control Commit pragma.

@tanabarr
Copy link
Contributor Author

@phender has advised me (after direct questioning) that there are @daos-stack/daos-gatekeeper concerns regarding lack of testing in this PR. First of all why has this not been commented in the PR that this is a concern after 5 days? This should have been brought up earlier and transparently communicated. Please can we have more transparency in gatekeeping to make the process more efficient. Thanks in advance.

With regards to testing, the PR has extensive unit test updates and storage scan is effectively covered in functional tests.

@mchaarawi
Copy link
Contributor

mchaarawi commented Dec 12, 2023

@phender has advised me (after direct questioning) that there are @daos-stack/daos-gatekeeper concerns regarding lack of testing in this PR. First of all why has this not been commented in the PR that this is a concern after 5 days? This should have been brought up earlier and transparently communicated. Please can we have more transparency in gatekeeping to make the process more efficient. Thanks in advance.

With regards to testing, the PR has extensive unit test updates and storage scan is effectively covered in functional tests.

I am on the gatekeeping channel, and I asked a question there if testing was enough for this PR. Im not sure how this was translated as a concern that there is not enough testing. I do not have enough knowledge of this code base and it was just a genuine question to other gatekeepers who might be more familiar. anyway this was just a question and not a concern.
but the main concern is that this is a huge PR and is missing a +1 from Niu to review the BIO code. Niu is also a gatekeeper and i believe this is why several gatekeepers (including me) have deferred on this PR in addition to the concern that the failing tests do look in the same area as the PR (not saying they are).

also please consider breaking PRs into smaller ones; otherwise you are getting more into the feature branch territory ;-)

@tanabarr
Copy link
Contributor Author

@phender has advised me (after direct questioning) that there are @daos-stack/daos-gatekeeper concerns regarding lack of testing in this PR. First of all why has this not been commented in the PR that this is a concern after 5 days? This should have been brought up earlier and transparently communicated. Please can we have more transparency in gatekeeping to make the process more efficient. Thanks in advance.
With regards to testing, the PR has extensive unit test updates and storage scan is effectively covered in functional tests.

I am on the gatekeeping channel, and I asked a question there if testing was enough for this PR. Im not sure how this was translated as a concern that there is not enough testing. I do not have enough knowledge of this code base and it was just a genuine question to other gatekeepers who might be more familiar. anyway this was just a question and not a concern. but the main concern is that this is a huge PR and is missing a +1 from Niu to review the BIO code. Niu is also a gatekeeper and i believe this is why several gatekeepers (including me) have deferred on this PR in addition to the concern that the failing tests do look in the same area as the PR (not saying they are).

also please consider breaking PRs into smaller ones; otherwise you are getting more into the feature branch territory ;-)

The tests have already been identified as known issues that exist within the subset run Features: control as acknowledged by @phender . I have also broken the PR out from #13385 and been discussing extensively with the control-plane team (hence why it has 3 approvals). it is a fundamental change in the DRPC plumbing as required by hotplug for a 2.6 required for ticket.

Copy link
Contributor

@NiuYawei NiuYawei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BIO change looks good to me. Sorry, I overlooked prior comments.

…move-bdev-scan-cache-pt1-proto

Features: control
Signed-off-by: Tom Nabarro <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr
Copy link
Contributor Author

I have triggered tests again, there are no code changes in this PR between CI runs 17-19 and the failures are the same known issues as mentioned in previous comments. I believe the PR can be force landed based on CI test coverage, unit test coverage and review approvals. TIA

@tanabarr
Copy link
Contributor Author

@mchaarawi can we please land this PR now we have an extensive list of review approvals and both https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13382/17/pipeline and https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13382/18/pipeline only failed on known issues that are expected to occur when Features: control pragma is used. build 19 is likely to take another few days.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13382/19/testReport/

@mchaarawi mchaarawi merged commit f34040a into master Dec 13, 2023
42 of 45 checks passed
@mchaarawi mchaarawi deleted the tanabarr/control-remove-bdev-scan-cache-pt1-proto branch December 13, 2023 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.
Development

Successfully merging this pull request may close these issues.

9 participants