Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14985 bio: Handle NVMe unplugged in list devs #13614

Merged
merged 8 commits into from
Jan 25, 2024

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Jan 16, 2024

dmg storage query list-devices fails if run after a NVMe device is
unplugged during a physical hot-remove. The failure is due to
UNPLUGGED NVMe device state resulting in a SMD object not being
populated with a NVMe controller reference. Unplugged devices should
be reported in list-devices and ignored in storage scan so handle
UNPLUGGED device state without returning an error.

Features: control nvme
Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr requested a review from a team as a code owner January 16, 2024 14:18
@tanabarr tanabarr requested review from mjmac and kjacque and removed request for a team January 16, 2024 14:18
@tanabarr tanabarr self-assigned this Jan 16, 2024
Copy link

Bug-tracker data:
Ticket title is 'dmg storage query list-devices fails if device is unplugged'
Status is 'In Review'
Labels: 'hotplug'
https://daosio.atlassian.net/browse/DAOS-14985

Copy link
Contributor

@NiuYawei NiuYawei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a regression introduced by removing device cache on control plane?

Besides the 'unplugged' case, please be aware that there is an 'unused' case: a device is plugged but not used by DAOS (not presented in SMD), is this case handled properly?

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13614/1/testReport/

@@ -198,6 +198,9 @@ func scanEngineBdevsOverDrpc(ctx context.Context, engine Engine, pbReq *ctlpb.Sc
return nil, errors.Errorf("smd %q has no ctrlr ref", sd.Uuid)
}

if sd.Ctrlr.DevState == ctlpb.NvmeDevState_UNPLUGGED {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if would be safer to skip the controller unless its state is in a set of known-good/expected states.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR to address this.

@tanabarr
Copy link
Contributor Author

Is this a regression introduced by removing device cache on control plane?

Besides the 'unplugged' case, please be aware that there is an 'unused' case: a device is plugged but not used by DAOS (not presented in SMD), is this case handled properly?

I don't think this is specifically a regression introduced by a recent PR, it's not something we currently test in CI. @shimizukko is working on adding these tests using hotplug event emulation. When a device is plugged but not used by DAOS it will not be reported by either list-devices or dmg storage scan when engines are running. It will be returned if engines are running and either no bdevs specified in the server config or it's PCI address is specified in the server config. Does that sound acceptable?

Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr requested review from NiuYawei and mjmac January 18, 2024 14:38
Comment on lines 201 to 203
if sd.Ctrlr.DevState != ctlpb.NvmeDevState_NORMAL &&
sd.Ctrlr.DevState != ctlpb.NvmeDevState_NEW &&
sd.Ctrlr.DevState != ctlpb.NvmeDevState_EVICTED {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work, but if it were me I'd add a method to check this on NvmeController, e.g. sd.Ctrlr.Scannable(). That keeps the definition of what makes the controller scannable in one place, so it's easier to maintain when the inevitable future changes happen (e.g. to protobufs, etc).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do this, probably should also do the same for the list of states that can be queried for health as that is different.

…ged-state

Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr
Copy link
Contributor Author

[tanabarr@wolf-314 daos]$ install-rocky/bin/dmg storage query list-devices -i
---------
localhost
---------
  Devices
    UUID:39e0f1a3-42f2-45f2-b8b3-86482af5a407 [TrAddr:]
      Roles: Targets:[0 2 4 6] Rank:0 State:UNPLUGGED LED:OFF
    UUID:af75dfdd-2c49-4722-9e67-d8559bd1f863 [TrAddr:5d0505:03:00.0]
      Roles: Targets:[1 3 5 7] Rank:0 State:NORMAL LED:OFF

mjmac
mjmac previously approved these changes Jan 18, 2024
knard38
knard38 previously approved these changes Jan 19, 2024
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13614/3/testReport/

shimizukko
shimizukko previously approved these changes Jan 20, 2024
}
D_ALLOC_PTR(ctrlr->namespaces[0]);
if (ctrlr->namespaces[0] == NULL) {
return -DER_NOMEM;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[defect] ctrlr->namespaces leaked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thanks for spotting, fixed in the relevant free function

Features: nvme control
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
Copy link
Contributor

@wangshilong wangshilong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C parts looks good.

D_FREE(dev->ctrlr->namespaces[0]);
D_FREE(dev->ctrlr->namespaces);
dev->ctrlr->namespaces = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[style] D_FREE() inside will assign NULL after free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you I was following the pattern in list_devs but I will remove both NULL assignments when I visit this area next.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13614/4/testReport/

@tanabarr
Copy link
Contributor Author

tanabarr commented Jan 23, 2024

https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13614/4/tests/

CI run with pragmas "Features: control nvme"

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13614/4/testReport/

@tanabarr tanabarr requested a review from a team January 23, 2024 23:13
@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jan 23, 2024
@tanabarr
Copy link
Contributor Author

@NiuYawei could you please approve the PR, @mchaarawi is asking for this before it can be force landed. Thanks

@tanabarr tanabarr requested a review from a team January 25, 2024 11:56
@tanabarr
Copy link
Contributor Author

GATEKEEPER: Please use the PR title and description above as the commit message when landing. TIA

@mchaarawi mchaarawi merged commit e86a3b3 into master Jan 25, 2024
48 of 51 checks passed
@mchaarawi mchaarawi deleted the tanabarr/bio-unplugged-state branch January 25, 2024 15:01
mjmac added a commit that referenced this pull request Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. hotplug
Development

Successfully merging this pull request may close these issues.

8 participants