Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pt2pt: add MPID_Allocate_vci #5904

Merged
merged 5 commits into from
Apr 19, 2022
Merged

pt2pt: add MPID_Allocate_vci #5904

merged 5 commits into from
Apr 19, 2022

Conversation

hzhou
Copy link
Contributor

@hzhou hzhou commented Mar 23, 2022

Pull Request Description

Reserved vci can be used to isolate communications. For example, stream-based progress should not involve GPU path. Using a reserved vci that normal GPU traffic won't touch can ensure that. Similarly, we can use a reserved vci for dynamic process connections.

Since users always know exactly their thread mapping, it may be simpler and more reliable to let user directly specify vci, rather than we simplicity do hashing. To allow such explicit vci extensions, we need a way to pass down the vci information to device layer. Extending the attr parameter can achieve that.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

@hzhou hzhou marked this pull request as draft March 23, 2022 19:41
@hzhou hzhou force-pushed the 2203_reserv_vci branch 3 times, most recently from d2e2305 to 3dae4c4 Compare March 23, 2022 21:57
@hzhou
Copy link
Contributor Author

hzhou commented Mar 23, 2022

test:mpich/custom --with-ch4-reserved-vcis=2
netmod: ch4:ofi
✔️

==== OFI dyanamic settings ====
num_vnis: 3
num_nics: 1

We got the 2 extra vcis!

I believe we can use the two extra vcis with global critical sections as well, but that is to be tested.

EDIT: outdated

@hzhou hzhou marked this pull request as ready for review March 24, 2022 00:10
@hzhou hzhou added the RFC Request for comment label Mar 24, 2022
@hzhou hzhou force-pushed the 2203_reserv_vci branch from 3dae4c4 to 723273e Compare April 14, 2022 16:43
@hzhou hzhou requested a review from raffenet April 14, 2022 16:43
@hzhou hzhou force-pushed the 2203_reserv_vci branch from 723273e to fc07d1d Compare April 15, 2022 21:49
@hzhou hzhou changed the title pt2pt: changes to allow reserved vci and explicit vci pt2pt: add MPID_Allocate_vci Apr 15, 2022
@hzhou hzhou force-pushed the 2203_reserv_vci branch from fc07d1d to e7a9041 Compare April 15, 2022 21:50
@hzhou hzhou force-pushed the 2203_reserv_vci branch from e7a9041 to ac1c62d Compare April 15, 2022 22:57
@hzhou
Copy link
Contributor Author

hzhou commented Apr 15, 2022

test:mpich/ch3/most
test:mpich/ch4/most
✔️

@hzhou hzhou added ready-for-review and removed RFC Request for comment labels Apr 16, 2022
@hzhou hzhou mentioned this pull request Apr 17, 2022
7 tasks
Copy link
Contributor

@raffenet raffenet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by this statement

VCI 0 is the default global vci, that is always used for implicit vci hashing.

I believe we're moving to passing explicit VCIs thru the ADI in some cases, but I would think that in the implicit case the VCI information is ignored by the device.

Comment on lines 417 to 418
MPIR_Assert(MPIR_CVAR_CH4_NUM_VCIS >= 1); /* maximum number vcis can be reserved */
MPIR_Assert(MPIR_CVAR_CH4_RESERVE_VCIS >= 0); /* maximum number vcis can be reserved */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should both of these lines have the same comment? i think the first line is illustrating something separate from the max number of vcis that can be reserved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the comment for MPIR_CVAR_CH4_NUM_VCIS. That was the original comment, but now we changed its meaning to number of implicit vcis rather than maximum number of vcis.

@hzhou hzhou force-pushed the 2203_reserv_vci branch from ac1c62d to e4ff5a5 Compare April 18, 2022 20:23
@hzhou
Copy link
Contributor Author

hzhou commented Apr 18, 2022

I'm a little confused by this statement

VCI 0 is the default global vci, that is always used for implicit vci hashing.

I believe we're moving to passing explicit VCIs thru the ADI in some cases, but I would think that in the implicit case the VCI information is ignored by the device.

I updated the commit to have MPID_Allocate_vci return explicit error code (and use vci pointer for value out). Using 0 to indicate error is a bit hacky. The corresponding commit message is now removed.

@hzhou hzhou force-pushed the 2203_reserv_vci branch 2 times, most recently from 63fc133 to f2f2305 Compare April 18, 2022 20:28
@hzhou hzhou requested a review from raffenet April 18, 2022 20:28
Comment on lines +953 to +956
**ch3nostream:Stream is not supported in ch3.
**ch4nostream:No streams available. Configure --enable-thread-cs=per-vci and --with-ch4-max-vcis=# to enable streams.
**outofstream:No streams available. Use MPIR_CVAR_CH4_RESERVE_VCIS to reserve the number of streams can be allocated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: its not obvious that vci is what is meant by "stream" in these error messages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you suggest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the issue is that this error message is meant for callers of MPIX_Stream_create? In that case it makes sense, its just odd to try to allocate X and get an error message that there are no more Y.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Right, the message is meant for MPIX_Stream_create. Only the lower layer has the details on what is wrong, so we have to craft the message there.

Add a cvar to set number of vcis that user can explicitly allocate,
using to-be-added MPIX_Stream_create. The total number of runtime vcis
will be the sum of MPIR_CVAR_CH4_NUM_VCIS + MPIR_CVAR_CH4_RESERVE_VCIS.
hzhou added 4 commits April 19, 2022 14:10
There will be two sets runtime vci pools. MPIDI_global.n_vcis will be
implicit hashing pool -- we are keeping the name to avoid too much code
churn. MPIDI_global.n_reserved_vcis will be explicit vci pool, which
only can be used by explicit allocation, e.g. MPIX_Stream_create.
Other than implicit vci hashing, the rest of the code path should not be
aware of the distinction of implicit vci vs reserved vci. Use
n_total_vcis as the total number of available vcis.
Initialize netmod to support MPIDI_glboal.n_total_vcis.

Now that all netmod and shmmod support multiple vcis, it is simpler to
move the mod logic into the ch4-layer hashing functions. Netmod still
can add another mod or simply overwrite the vci if it doesn't support
multiple vci or support less number of vcis. For now, we remove them for
cleaner code.

MPIR_CVAR_CH4_OFI_MAX_VNIS and MPIR_CVAR_CH4_UCX_MAX_VNIS are removed
since we can't have arbitrary vnis anyway.

Moving the mod into hashing functions allows implementing the reserved
vci logic.
This ADI allows MPIR layer to request for explicit vcis.
@hzhou hzhou force-pushed the 2203_reserv_vci branch from f2f2305 to 6252a53 Compare April 19, 2022 19:11
@hzhou hzhou merged commit 2067636 into pmodels:main Apr 19, 2022
@hzhou hzhou deleted the 2203_reserv_vci branch April 19, 2022 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants