Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

accelerator/cuda: Add delayed initialization logic #11253

Merged
merged 2 commits into from
Jan 11, 2023

Conversation

wckzhang
Copy link
Contributor

The current implementation requires the application to do cudaInit before calling MPI_Init. Added delayed initilization logic to wait as long as possible
before creating resources requiring a cuContext.

Signed-off-by: William Zhang [email protected]

@wckzhang
Copy link
Contributor Author

I just added the delayed function call for the create functions since the other functions are dependent on those being called first.

@nysal
Copy link
Member

nysal commented Jan 3, 2023

This might impact performance for MPI_THREAD_MULTIPLE. The lazy initialization code always takes a mutex in this case and there might be multiple calls to these accelerator functions in the fast path. An alternative is to implement something like the double checked locking pattern (https://en.wikipedia.org/wiki/Double-checked_locking). You still incur the overhead of an rmb and a branch, but that should be relatively less expensive.

@wckzhang
Copy link
Contributor Author

wckzhang commented Jan 4, 2023

@nysal apparently it can be unsafe, I'm not sure how it can be unsafe so I'm not sure whether to implement it. "The pattern, when implemented in some language/hardware combinations, can be unsafe. At times, it can be considered an anti-pattern.[2]"

@edgargabriel
Copy link
Member

Can't you simply check for accelerator_cuda_init_complete before acquiring the lock, and after that? Is an overkill until the condition becomes true, but for the vast majority of the runtime of an application it will avoid having to acquire the lock for every function call.

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wckzhang
Copy link
Contributor Author

wckzhang commented Jan 5, 2023

@edgargabriel that's what double checked locking means as far as I can understand. What I don't understand is the "unsafe" part of it that is described. I can add the checks though

@nysal
Copy link
Member

nysal commented Jan 5, 2023

@nysal apparently it can be unsafe, I'm not sure how it can be unsafe so I'm not sure whether to implement it. "The pattern, when implemented in some language/hardware combinations, can be unsafe. At times, it can be considered an anti-pattern.[2]"

It used to be considered broken, as there was no portable way to specify memory ordering rules from C/C++ until the C11/C++11 standard. So you'll still find a lot of old articles that say its broken. Reference - https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/

This pattern is used in glibc to implement pthread_once()/std::call_once() - https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/nptl/pthread_once.c#L135

If we want to do this as a followup optimization I can take a look at it. However the current code will not perform well on large systems where atomics have a fairly significant overhead.

@edgargabriel
Copy link
Member

In my opinion, if that pattern is considered 'unsafe', we have most likely numerous other places in the code that would also fall into the same category,

@wckzhang
Copy link
Contributor Author

wckzhang commented Jan 5, 2023

Updated with double checked locking

@janjust
Copy link
Contributor

janjust commented Jan 6, 2023

Is this PR ready to go?

Comment on lines 374 to 377
result = opal_accelerator_cuda_delayed_init();
if (0 != result) {
return result;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be worth wrapping some of the error checks in OPAL_UNLIKELY(). I did this locally for a few of these and saw a small performance boost in some OMB benchmarks.

Suggested change
result = opal_accelerator_cuda_delayed_init();
if (0 != result) {
return result;
}
result = opal_accelerator_cuda_delayed_init();
if (OPAL_UNLIKELY(0 != result)) {
return result;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrapped the ones in this PR, I'll maybe add a separate PR wrapping the other error paths

Copy link
Member

@nysal nysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

{
int retval, i, j;
CUresult result;
CUresult result = OPAL_SUCCESS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong type (CUresult is an enum that we don't control and we shouldn't mix enum and non-enum values)

Suggested change
CUresult result = OPAL_SUCCESS;
int result = OPAL_SUCCESS;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


static opal_accelerator_base_module_t* accelerator_cuda_init(void)
{
int retval, i, j;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wckzhang There are a bunch more instances where the types don't match. It also seems that the existing code returned CUDA error codes as OPAL error codes (instead of OPAL_ERROR) in some instances. This PR didn't touch those but it seems wrong...

The current implementation requires the application to
do cudaInit before calling MPI_Init. Added delayed
initilization logic to wait as long as possible
before creating resources requiring a cuContext.

Signed-off-by: William Zhang <[email protected]>
@wckzhang
Copy link
Contributor Author

I also added a separate commit to change the CUresult returns to opal error codes

Copy link
Contributor

@devreal devreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One variable seems to be missing a definition but otherwise LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants