-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accelerator/cuda: Add delayed initialization logic #11253
Conversation
I just added the delayed function call for the create functions since the other functions are dependent on those being called first. |
This might impact performance for MPI_THREAD_MULTIPLE. The lazy initialization code always takes a mutex in this case and there might be multiple calls to these accelerator functions in the fast path. An alternative is to implement something like the double checked locking pattern (https://en.wikipedia.org/wiki/Double-checked_locking). You still incur the overhead of an rmb and a branch, but that should be relatively less expensive. |
@nysal apparently it can be unsafe, I'm not sure how it can be unsafe so I'm not sure whether to implement it. "The pattern, when implemented in some language/hardware combinations, can be unsafe. At times, it can be considered an anti-pattern.[2]" |
Can't you simply check for accelerator_cuda_init_complete before acquiring the lock, and after that? Is an overkill until the condition becomes true, but for the vast majority of the runtime of an application it will avoid having to acquire the lock for every function call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@edgargabriel that's what double checked locking means as far as I can understand. What I don't understand is the "unsafe" part of it that is described. I can add the checks though |
It used to be considered broken, as there was no portable way to specify memory ordering rules from C/C++ until the C11/C++11 standard. So you'll still find a lot of old articles that say its broken. Reference - https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/ This pattern is used in glibc to implement pthread_once()/std::call_once() - https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/nptl/pthread_once.c#L135 If we want to do this as a followup optimization I can take a look at it. However the current code will not perform well on large systems where atomics have a fairly significant overhead. |
In my opinion, if that pattern is considered 'unsafe', we have most likely numerous other places in the code that would also fall into the same category, |
Updated with double checked locking |
Is this PR ready to go? |
result = opal_accelerator_cuda_delayed_init(); | ||
if (0 != result) { | ||
return result; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be worth wrapping some of the error checks in OPAL_UNLIKELY()
. I did this locally for a few of these and saw a small performance boost in some OMB benchmarks.
result = opal_accelerator_cuda_delayed_init(); | |
if (0 != result) { | |
return result; | |
} | |
result = opal_accelerator_cuda_delayed_init(); | |
if (OPAL_UNLIKELY(0 != result)) { | |
return result; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrapped the ones in this PR, I'll maybe add a separate PR wrapping the other error paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
{ | ||
int retval, i, j; | ||
CUresult result; | ||
CUresult result = OPAL_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong type (CUresult is an enum that we don't control and we shouldn't mix enum and non-enum values)
CUresult result = OPAL_SUCCESS; | |
int result = OPAL_SUCCESS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
static opal_accelerator_base_module_t* accelerator_cuda_init(void) | ||
{ | ||
int retval, i, j; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wckzhang There are a bunch more instances where the types don't match. It also seems that the existing code returned CUDA error codes as OPAL error codes (instead of OPAL_ERROR
) in some instances. This PR didn't touch those but it seems wrong...
The current implementation requires the application to do cudaInit before calling MPI_Init. Added delayed initilization logic to wait as long as possible before creating resources requiring a cuContext. Signed-off-by: William Zhang <[email protected]>
43cfe89
to
480e029
Compare
I also added a separate commit to change the CUresult returns to opal error codes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One variable seems to be missing a definition but otherwise LGTM
Signed-off-by: William Zhang <[email protected]>
480e029
to
48ae44b
Compare
The current implementation requires the application to do cudaInit before calling MPI_Init. Added delayed initilization logic to wait as long as possible
before creating resources requiring a cuContext.
Signed-off-by: William Zhang [email protected]