-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14985 bio: Handle NVMe unplugged in list devs #13614
Conversation
Required-githooks: true Signed-off-by: Tom Nabarro <[email protected]>
Bug-tracker data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a regression introduced by removing device cache on control plane?
Besides the 'unplugged' case, please be aware that there is an 'unused' case: a device is plugged but not used by DAOS (not presented in SMD), is this case handled properly?
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13614/1/testReport/ |
@@ -198,6 +198,9 @@ func scanEngineBdevsOverDrpc(ctx context.Context, engine Engine, pbReq *ctlpb.Sc | |||
return nil, errors.Errorf("smd %q has no ctrlr ref", sd.Uuid) | |||
} | |||
|
|||
if sd.Ctrlr.DevState == ctlpb.NvmeDevState_UNPLUGGED { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if would be safer to skip the controller unless its state is in a set of known-good/expected states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the PR to address this.
…ged-state Signed-off-by: Tom Nabarro <[email protected]>
I don't think this is specifically a regression introduced by a recent PR, it's not something we currently test in CI. @shimizukko is working on adding these tests using hotplug event emulation. When a device is plugged but not used by DAOS it will not be reported by either |
Features: control Required-githooks: true Signed-off-by: Tom Nabarro <[email protected]>
if sd.Ctrlr.DevState != ctlpb.NvmeDevState_NORMAL && | ||
sd.Ctrlr.DevState != ctlpb.NvmeDevState_NEW && | ||
sd.Ctrlr.DevState != ctlpb.NvmeDevState_EVICTED { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will work, but if it were me I'd add a method to check this on NvmeController
, e.g. sd.Ctrlr.Scannable()
. That keeps the definition of what makes the controller scannable in one place, so it's easier to maintain when the inevitable future changes happen (e.g. to protobufs, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do this, probably should also do the same for the list of states that can be queried for health as that is different.
…ged-state Required-githooks: true Signed-off-by: Tom Nabarro <[email protected]>
Features: control Required-githooks: true Signed-off-by: Tom Nabarro <[email protected]>
Required-githooks: true Signed-off-by: Tom Nabarro <[email protected]>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13614/3/testReport/ |
…ged-state Signed-off-by: Tom Nabarro <[email protected]>
} | ||
D_ALLOC_PTR(ctrlr->namespaces[0]); | ||
if (ctrlr->namespaces[0] == NULL) { | ||
return -DER_NOMEM; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[defect] ctrlr->namespaces leaked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thanks for spotting, fixed in the relevant free function
Features: nvme control Required-githooks: true Signed-off-by: Tom Nabarro <[email protected]>
b423777
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C parts looks good.
D_FREE(dev->ctrlr->namespaces[0]); | ||
D_FREE(dev->ctrlr->namespaces); | ||
dev->ctrlr->namespaces = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[style] D_FREE() inside will assign NULL after free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you I was following the pattern in list_devs but I will remove both NULL assignments when I visit this area next.
Test stage Functional Hardware Large completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13614/4/testReport/ |
https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-13614/4/tests/
CI run with pragmas "Features: control nvme" |
Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13614/4/testReport/ |
@NiuYawei could you please approve the PR, @mchaarawi is asking for this before it can be force landed. Thanks |
GATEKEEPER: Please use the PR title and description above as the commit message when landing. TIA |
dmg storage query list-devices fails if run after a NVMe device is
unplugged during a physical hot-remove. The failure is due to
UNPLUGGED NVMe device state resulting in a SMD object not being
populated with a NVMe controller reference. Unplugged devices should
be reported in list-devices and ignored in storage scan so handle
UNPLUGGED device state without returning an error.
Features: control nvme
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: