-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pt2pt: add MPID_Allocate_vci #5904
Conversation
d2e2305
to
3dae4c4
Compare
test:mpich/custom --with-ch4-reserved-vcis=2
We got the 2 extra vcis! I believe we can use the two extra vcis with global critical sections as well, but that is to be tested. EDIT: outdated |
test:mpich/ch3/most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by this statement
VCI 0 is the default global vci, that is always used for implicit vci hashing.
I believe we're moving to passing explicit VCIs thru the ADI in some cases, but I would think that in the implicit case the VCI information is ignored by the device.
src/mpid/ch4/src/ch4_init.c
Outdated
MPIR_Assert(MPIR_CVAR_CH4_NUM_VCIS >= 1); /* maximum number vcis can be reserved */ | ||
MPIR_Assert(MPIR_CVAR_CH4_RESERVE_VCIS >= 0); /* maximum number vcis can be reserved */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should both of these lines have the same comment? i think the first line is illustrating something separate from the max number of vcis that can be reserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the comment for MPIR_CVAR_CH4_NUM_VCIS
. That was the original comment, but now we changed its meaning to number of implicit vcis rather than maximum number of vcis.
I updated the commit to have |
63fc133
to
f2f2305
Compare
**ch3nostream:Stream is not supported in ch3. | ||
**ch4nostream:No streams available. Configure --enable-thread-cs=per-vci and --with-ch4-max-vcis=# to enable streams. | ||
**outofstream:No streams available. Use MPIR_CVAR_CH4_RESERVE_VCIS to reserve the number of streams can be allocated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: its not obvious that vci is what is meant by "stream" in these error messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the issue is that this error message is meant for callers of MPIX_Stream_create
? In that case it makes sense, its just odd to try to allocate X and get an error message that there are no more Y.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Right, the message is meant for MPIX_Stream_create
. Only the lower layer has the details on what is wrong, so we have to craft the message there.
Add a cvar to set number of vcis that user can explicitly allocate, using to-be-added MPIX_Stream_create. The total number of runtime vcis will be the sum of MPIR_CVAR_CH4_NUM_VCIS + MPIR_CVAR_CH4_RESERVE_VCIS.
There will be two sets runtime vci pools. MPIDI_global.n_vcis will be implicit hashing pool -- we are keeping the name to avoid too much code churn. MPIDI_global.n_reserved_vcis will be explicit vci pool, which only can be used by explicit allocation, e.g. MPIX_Stream_create.
Other than implicit vci hashing, the rest of the code path should not be aware of the distinction of implicit vci vs reserved vci. Use n_total_vcis as the total number of available vcis.
Initialize netmod to support MPIDI_glboal.n_total_vcis. Now that all netmod and shmmod support multiple vcis, it is simpler to move the mod logic into the ch4-layer hashing functions. Netmod still can add another mod or simply overwrite the vci if it doesn't support multiple vci or support less number of vcis. For now, we remove them for cleaner code. MPIR_CVAR_CH4_OFI_MAX_VNIS and MPIR_CVAR_CH4_UCX_MAX_VNIS are removed since we can't have arbitrary vnis anyway. Moving the mod into hashing functions allows implementing the reserved vci logic.
This ADI allows MPIR layer to request for explicit vcis.
Pull Request Description
Reserved vci can be used to isolate communications. For example, stream-based progress should not involve GPU path. Using a reserved vci that normal GPU traffic won't touch can ensure that. Similarly, we can use a reserved vci for dynamic process connections.
Since users always know exactly their thread mapping, it may be simpler and more reliable to let user directly specify vci, rather than we simplicity do hashing. To allow such explicit vci extensions, we need a way to pass down the vci information to device layer. Extending the
attr
parameter can achieve that.[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.