Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16838 control: Fix dmg storage query usage with emulated NVMe #15545

Merged
merged 3 commits into from
Dec 19, 2024

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Nov 29, 2024

Fix a regression which prevents dmg storage query usage from
enumerating devices backed with emulated (AIO file or kdev) NVMe.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@tanabarr tanabarr added bug control-plane work on the management infrastructure of the DAOS Control Plane usability Changes specific to user facing tools or behaviour. labels Nov 29, 2024
@tanabarr tanabarr self-assigned this Nov 29, 2024
Copy link

github-actions bot commented Nov 29, 2024

Ticket title is 'Fix dmg storage query usage with emulated NVMe'
Status is 'In Review'
Labels: 'GCP,SPDK,control-plane'
https://daosio.atlassian.net/browse/DAOS-16838

@daosbuild1
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15545/1/execution/node/333/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15545/1/execution/node/273/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15545/1/execution/node/398/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15545/1/execution/node/395/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15545/1/execution/node/372/log

@daosbuild1
Copy link
Collaborator

@tanabarr tanabarr force-pushed the tanabarr/control-aio-usagequery-fix branch from a3196ea to fe52d98 Compare November 29, 2024 15:43
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15545/2/execution/node/1463/log

@tanabarr tanabarr force-pushed the tanabarr/control-aio-usagequery-fix branch from fe52d98 to c09dc7b Compare December 2, 2024 09:15
@tanabarr tanabarr marked this pull request as ready for review December 2, 2024 09:32
@tanabarr tanabarr requested review from a team as code owners December 2, 2024 09:32
@tanabarr tanabarr requested review from mjmac, kjacque and knard38 December 2, 2024 09:33
@tanabarr tanabarr force-pushed the tanabarr/control-aio-usagequery-fix branch 2 times, most recently from 3045784 to b46e32a Compare December 2, 2024 10:44
Features: control
Required-githooks: true

Signed-off-by: Tom Nabarro <[email protected]>
@tanabarr tanabarr force-pushed the tanabarr/control-aio-usagequery-fix branch from b46e32a to 77bbb86 Compare December 2, 2024 11:26
@tanabarr
Copy link
Contributor Author

tanabarr commented Dec 3, 2024

NLT failing on unrelated dfuse valgrind issue

knard38
knard38 previously approved these changes Dec 4, 2024
Copy link
Contributor

@knard38 knard38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

kjacque
kjacque previously approved these changes Dec 5, 2024
Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. The comment is more of a question/suggestion.

Comment on lines 351 to 352
// Mock identifier for emulated NVMe mode where devices have no PCI-address.
addr = fmt.Sprintf("0000:00:0.%d", i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only question is whether this could possibly conflict with a real device, since it is used in the seenCtrlrs map. I wonder if it would be better to use a "name" which could be either a PCI address for the real device, or the file name for emulated nvme.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DPDK, PCI_ANY_ID is defined as 0xffff: https://doc.dpdk.org/api/rte__pci_8h.html#a53aca768a081fcf56089353d805ab77c

That also seems to match the kernel: https://elixir.bootlin.com/linux/v6.12.1/source/include/linux/mod_devicetable.h#L18

Might be better to use that value, as there is zero possibility of conflicting with a real vendor ID that way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so are you suggesting FFFF:00:0.%d as the format string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, if the goal is to avoid any conflict with a device that might be found via the topology scan. I can't find an official PCI spec to cite, but from what I found about how linux and DPDK do things, that value is a "can't happen" value, i.e. there's no way that a valid device could show up with that vendor ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we getting confused here between PCI-address and PCI-ID? https://wiki.debian.org/HowToIdentifyADevice/PCI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first segment of the PCI address is the domain not the vendor ID afais

Copy link
Contributor Author

@tanabarr tanabarr Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've improved the resilience against collisions by using 0xffff PCI-domain segment and 0xf func value (neither would be used in a "real" NVMe device address as domain segment is either 0x10000 or 0x0000 in almost all allocated addresses and BDF func value is 0-7). Reference: https://dottedmag.net/blog/pci-basics/ .

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15545/8/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15545/8/testReport/

@tanabarr tanabarr requested a review from a team December 19, 2024 20:58
@daltonbohning daltonbohning merged commit 1499fcc into master Dec 19, 2024
62 checks passed
@daltonbohning daltonbohning deleted the tanabarr/control-aio-usagequery-fix branch December 19, 2024 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug control-plane work on the management infrastructure of the DAOS Control Plane usability Changes specific to user facing tools or behaviour.
Development

Successfully merging this pull request may close these issues.

6 participants