-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/ofi: Lazily register FI_MULTI_RECV buffers #6422
Conversation
6dcc9b5
to
7c2f833
Compare
test:mpich/ch4/ofi |
test:mpich/ch4/ofi |
2075cd2
to
f848889
Compare
test:mpich/ch4/ofi |
TODO: move CVAR for tmp buf registration into |
test:mpich/ch4/ofi |
d9d68ff
to
a387ea6
Compare
test:mpich/ch4/ofi |
@raffenet Can we do a gpu test with the CVAR disabling the host registration? |
Actually the CVAR disables registration by default. I can add a dummy commit to re-enable registration and re-run, if desired. |
I see. I was hoping some of the GPU testing failures can be addressed by not registering the buffer. But a bummer. Can you confirm that we fixed the GPU memory issue? Since we do that lazy register, I think we can leave the CVAR default on, right? |
Yes, I think we can default it to on. I'll double check a hello world program and confirm we don't take up any resources. |
Avoid consuming GPU resources during initialization by using regular malloc for FI_MULTI_RECV buffers. It may be possible to register the buffers later if we detect they are being used to copy data to the GPU.
Rather than allocate a bunch of buffers, just use one big one with offsets.
MPIR_gpu_register_host is used to register buffers on the host with the GPU. Use a single CVAR to control buffer registration instead of scattering in various parts of the code.
test:mpich/ch4/ofi |
|
Test results are consistent with registration turned back on. Outstanding question is, do we want to have an additional switch in ch4/ofi to disable registration of |
I see. You mean when provider will register the multi-recv buffer? I tend to believe that CUDA or any GPU runtime will cache the address and additional registration should be no-op. In any case, I would suggest that let's not worry about such case until they become a fact, and make decision then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Works for me. |
Pull Request Description
Avoid consuming GPU resources during initialization by using regular malloc for FI_MULTI_RECV buffers. Do lazy registration when the first GPU communication is detected.
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.