Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/ofi: Add psm3 capability set #5864

Merged
merged 2 commits into from
Apr 29, 2022
Merged

Conversation

nitbhat
Copy link
Contributor

@nitbhat nitbhat commented Feb 23, 2022

Pull Request Description

This PR adds a capability set for the PSM3 provider. All of the current configuration is borrowed from the PSM2 provider.

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@nitbhat nitbhat changed the title Add psm3 capability set ch4: Add psm3 capability set Feb 23, 2022
@nitbhat nitbhat changed the title ch4: Add psm3 capability set ch4/ofi: Add psm3 capability set Feb 23, 2022
@hzhou
Copy link
Contributor

hzhou commented Feb 23, 2022

We disabled libfabric psm3 --

ofi_subdir_args="--enable-embedded --disable-psm3"
-- due to compile issues on our Jenkins. We need to revisit the issue and see if it can be resolved.

@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 5, 2022

@hzhou, shall I enable psm3 and test this on jenkins to see what compile issues you're getting?

@hzhou
Copy link
Contributor

hzhou commented Apr 5, 2022

@hzhou, shall I enable psm3 and test this on jenkins to see what compile issues you're getting?

Sure. Go ahead

@hzhou
Copy link
Contributor

hzhou commented Apr 5, 2022

@nitbhat When you done, use the custom test for it. For example -- #5904 (comment)

The default review tests will use pre-built modules, which skips the rebuild of libfabric.

@nitbhat nitbhat force-pushed the add_psm3_cap_set branch from cec1284 to faab8e6 Compare April 5, 2022 21:07
@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 5, 2022

test:mpich/custom
netmod: ch4:ofi

@nitbhat nitbhat force-pushed the add_psm3_cap_set branch from faab8e6 to 99baefb Compare April 5, 2022 21:40
@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 5, 2022

test:mpich/custom
netmod: ch4:ofi

@hzhou
Copy link
Contributor

hzhou commented Apr 5, 2022

Looks like it is working now. For reference, the original reason we disabled psm3 is here -- #5344 (comment)

@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 6, 2022

@hzhou: The build looks good and the tests seem to be working. But, is there a way I can specify to the test framework on jenkins to use the psm3 provider? (maybe with netmod:ch4:ofi:psm3?)

I see that sockets is used as the provider for the tests.

@hzhou
Copy link
Contributor

hzhou commented Apr 6, 2022

@hzhou: The build looks good and the tests seem to be working. But, is there a way I can specify to the test framework on jenkins to use the psm3 provider? (maybe with netmod:ch4:ofi:psm3?)

I see that sockets is used as the provider for the tests.

Our Jenkins only have Infiniband card. Does psm3 work with that?

@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 6, 2022

Yes, psm3 works with Infiniband, it is a performant alternative to sockets over IB.

@hzhou
Copy link
Contributor

hzhou commented Apr 6, 2022

test:mpich/custom
netmod: ch4:ofi
env: FI_PROVIDER=psm3

@hzhou
Copy link
Contributor

hzhou commented Apr 6, 2022

test:mpich/ch4/ofi

@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 6, 2022

Currently, the PSM3 provider also needs another environment variable to work correctly, PSM3_MULTI_EP=1. I'll relaunch the tests with that if the tests failing, which I'm guessing will.

@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 21, 2022

test:mpich/custom
netmod: ch4:ofi
env: FI_PROVIDER=psm3

@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 27, 2022

There is only one failure seen for psm3. I have seen this issue before and it happens intermittently for some tests. I've opened #5975 and will add an xfail to this PR.

@nitbhat nitbhat force-pushed the add_psm3_cap_set branch 2 times, most recently from 7fdb3f0 to 2e042ac Compare April 27, 2022 15:06
@@ -86,6 +86,9 @@
# Sunf90 forbid passing cray pointer as integer
* solstudio * * * /^allocmemf90/ xfail=ticket0 f90/ext/testlist

# psm3 specific failures
* * * ch4:ofi:psm3 * /^p_red .*/ xfail=issue5975 coll/testlist
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzhou: Is the right way to xfail a psm3 specific failure?

@hzhou
Copy link
Contributor

hzhou commented Apr 27, 2022

There is only one failure seen for psm3. I have seen this issue before and it happens intermittently for some tests. I've opened #5975 and will add an xfail to this PR.

The xfail entry won't work against the custom tests since psm3 is selected via FI_PROVIDER, not configure device option. Since we don't have review tests targeting psm3 yet, the xfail entries is not necessary. Once we add such jobs, we can add xfail entries targeting the particular job or jenkins config option. Hopefully, the bug is fixed by then 🤞

@hzhou
Copy link
Contributor

hzhou commented Apr 27, 2022

test:mpich/custom
netmod: ch4:ofi

@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 27, 2022

There is only one failure seen for psm3. I have seen this issue before and it happens intermittently for some tests. I've opened #5975 and will add an xfail to this PR.

The xfail entry won't work against the custom tests since psm3 is selected via FI_PROVIDER, not configure device option. Since we don't have review tests targeting psm3 yet, the xfail entries is not necessary. Once we add such jobs, we can add xfail entries targeting the particular job or jenkins config option. Hopefully, the bug is fixed by then 🤞

Okay, understood. I'll go ahead and remove that commit in that case.

@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 27, 2022

test:mpich/custom
netmod: ch4:ofi

@hzhou
Copy link
Contributor

hzhou commented Apr 27, 2022

Since psm3 works on infiniband, we may consider using psm3 as default for testing. We'll need fix the issue before the switch.

@hzhou
Copy link
Contributor

hzhou commented Apr 27, 2022

@nitbhat Does psm3 make psm2 obsolete?

@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 28, 2022

@nitbhat Does psm3 make psm2 obsolete?

@hzhou: No, psm2 continues to be the preferred provider for OPA.

Copy link
Contributor

@hzhou hzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hzhou
Copy link
Contributor

hzhou commented Apr 29, 2022

@nitbhat Approved. Could you rebase?

@nitbhat
Copy link
Contributor Author

nitbhat commented Apr 29, 2022

@nitbhat Approved. Could you rebase?

Yes, rebased.

@hzhou
Copy link
Contributor

hzhou commented Apr 29, 2022

test:mpich/warnings

@hzhou hzhou merged commit 6b09399 into pmodels:main Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants