accelerator/cuda: Add delayed initialization logic #11253

wckzhang · 2022-12-28T21:02:33Z

The current implementation requires the application to do cudaInit before calling MPI_Init. Added delayed initilization logic to wait as long as possible
before creating resources requiring a cuContext.

Signed-off-by: William Zhang [email protected]

wckzhang · 2022-12-28T21:03:30Z

I just added the delayed function call for the create functions since the other functions are dependent on those being called first.

nysal · 2023-01-03T14:57:09Z

This might impact performance for MPI_THREAD_MULTIPLE. The lazy initialization code always takes a mutex in this case and there might be multiple calls to these accelerator functions in the fast path. An alternative is to implement something like the double checked locking pattern (https://en.wikipedia.org/wiki/Double-checked_locking). You still incur the overhead of an rmb and a branch, but that should be relatively less expensive.

wckzhang · 2023-01-04T18:13:09Z

@nysal apparently it can be unsafe, I'm not sure how it can be unsafe so I'm not sure whether to implement it. "The pattern, when implemented in some language/hardware combinations, can be unsafe. At times, it can be considered an anti-pattern.[2]"

edgargabriel · 2023-01-04T23:31:15Z

Can't you simply check for accelerator_cuda_init_complete before acquiring the lock, and after that? Is an overkill until the condition becomes true, but for the vast majority of the runtime of an application it will avoid having to acquire the lock for every function call.

edgargabriel

LGTM

wckzhang · 2023-01-05T00:54:22Z

@edgargabriel that's what double checked locking means as far as I can understand. What I don't understand is the "unsafe" part of it that is described. I can add the checks though

nysal · 2023-01-05T02:02:06Z

@nysal apparently it can be unsafe, I'm not sure how it can be unsafe so I'm not sure whether to implement it. "The pattern, when implemented in some language/hardware combinations, can be unsafe. At times, it can be considered an anti-pattern.[2]"

It used to be considered broken, as there was no portable way to specify memory ordering rules from C/C++ until the C11/C++11 standard. So you'll still find a lot of old articles that say its broken. Reference - https://preshing.com/20130930/double-checked-locking-is-fixed-in-cpp11/

This pattern is used in glibc to implement pthread_once()/std::call_once() - https://github.com/bminor/glibc/blob/b92a49359f33a461db080a33940d73f47c756126/nptl/pthread_once.c#L135

If we want to do this as a followup optimization I can take a look at it. However the current code will not perform well on large systems where atomics have a fairly significant overhead.

edgargabriel · 2023-01-05T15:02:52Z

In my opinion, if that pattern is considered 'unsafe', we have most likely numerous other places in the code that would also fall into the same category,

wckzhang · 2023-01-05T18:18:27Z

Updated with double checked locking

opal/mca/accelerator/cuda/accelerator_cuda_component.c

janjust · 2023-01-06T16:48:28Z

Is this PR ready to go?

darrylabbate · 2023-01-06T19:25:36Z

opal/mca/accelerator/cuda/accelerator_cuda.c

+    result = opal_accelerator_cuda_delayed_init();
+    if (0 != result) {
+        return result;
+    }


I think it might be worth wrapping some of the error checks in OPAL_UNLIKELY(). I did this locally for a few of these and saw a small performance boost in some OMB benchmarks.

Suggested change

result = opal_accelerator_cuda_delayed_init();

if (0 != result) {

return result;

}

result = opal_accelerator_cuda_delayed_init();

if (OPAL_UNLIKELY(0 != result)) {

return result;

}

I wrapped the ones in this PR, I'll maybe add a separate PR wrapping the other error paths

nysal

LGTM

devreal · 2023-01-07T05:13:51Z

opal/mca/accelerator/cuda/accelerator_cuda_component.c

 {
-    int retval, i, j;
-    CUresult result;
+    CUresult result = OPAL_SUCCESS;


Wrong type (CUresult is an enum that we don't control and we shouldn't mix enum and non-enum values)

Suggested change

CUresult result = OPAL_SUCCESS;

int result = OPAL_SUCCESS;

devreal · 2023-01-07T05:16:37Z

opal/mca/accelerator/cuda/accelerator_cuda_component.c

+
+static opal_accelerator_base_module_t* accelerator_cuda_init(void)
+{
+    int retval, i, j;


Unused variables?

devreal

@wckzhang There are a bunch more instances where the types don't match. It also seems that the existing code returned CUDA error codes as OPAL error codes (instead of OPAL_ERROR) in some instances. This PR didn't touch those but it seems wrong...

opal/mca/accelerator/cuda/accelerator_cuda.c

opal/mca/accelerator/cuda/accelerator_cuda_component.c

The current implementation requires the application to do cudaInit before calling MPI_Init. Added delayed initilization logic to wait as long as possible before creating resources requiring a cuContext. Signed-off-by: William Zhang <[email protected]>

wckzhang · 2023-01-10T22:17:13Z

I also added a separate commit to change the CUresult returns to opal error codes

devreal

One variable seems to be missing a definition but otherwise LGTM

opal/mca/accelerator/cuda/accelerator_cuda.c

Signed-off-by: William Zhang <[email protected]>

github-actions bot added the Target: main label Dec 28, 2022

wckzhang requested review from edgargabriel, Akshay-Venkatesh, hppritcha and bwbarrett December 28, 2022 21:04

wckzhang mentioned this pull request Dec 28, 2022

accelerator framework: cuda component assumes application initializes CUDA prior to calling MPI_init #11084

Closed

edgargabriel approved these changes Jan 4, 2023

View reviewed changes

wckzhang force-pushed the cuda_delayed branch from 3de5332 to 192c87c Compare January 5, 2023 18:18

nysal reviewed Jan 6, 2023

View reviewed changes

opal/mca/accelerator/cuda/accelerator_cuda_component.c Show resolved Hide resolved

nysal reviewed Jan 6, 2023

View reviewed changes

opal/mca/accelerator/cuda/accelerator_cuda_component.c Show resolved Hide resolved

wckzhang force-pushed the cuda_delayed branch from 192c87c to 3d59428 Compare January 6, 2023 19:05

darrylabbate reviewed Jan 6, 2023

View reviewed changes

nysal approved these changes Jan 7, 2023

View reviewed changes

devreal reviewed Jan 7, 2023

View reviewed changes

wckzhang force-pushed the cuda_delayed branch from 3d59428 to 43cfe89 Compare January 9, 2023 18:09

wckzhang requested review from darrylabbate and devreal January 10, 2023 19:39

devreal requested changes Jan 10, 2023

View reviewed changes

wckzhang force-pushed the cuda_delayed branch from 43cfe89 to 480e029 Compare January 10, 2023 22:16

wckzhang requested a review from devreal January 10, 2023 22:16

devreal reviewed Jan 10, 2023

View reviewed changes

opal/mca/accelerator/cuda/accelerator_cuda.c Outdated Show resolved Hide resolved

darrylabbate approved these changes Jan 10, 2023

View reviewed changes

accelerator/cuda: Return OPAL error codes instead of CUresult

48ae44b

Signed-off-by: William Zhang <[email protected]>

wckzhang force-pushed the cuda_delayed branch from 480e029 to 48ae44b Compare January 11, 2023 18:17

wckzhang requested a review from devreal January 11, 2023 18:17

devreal approved these changes Jan 11, 2023

View reviewed changes

wckzhang merged commit 38441ef into open-mpi:main Jan 11, 2023

This was referenced Jan 11, 2023

v5.0.x accelerator/cuda: Add delayed initialization logic #11296

Closed

v5.0.x accelerator/cuda: Add delayed initialization logic #11297

Merged

devreal mentioned this pull request Aug 14, 2023

Issues with CUDA accelerator component initialization #11831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accelerator/cuda: Add delayed initialization logic #11253

accelerator/cuda: Add delayed initialization logic #11253

wckzhang commented Dec 28, 2022

wckzhang commented Dec 28, 2022

nysal commented Jan 3, 2023

wckzhang commented Jan 4, 2023

edgargabriel commented Jan 4, 2023

edgargabriel left a comment

wckzhang commented Jan 5, 2023

nysal commented Jan 5, 2023 •

edited

Loading

edgargabriel commented Jan 5, 2023

wckzhang commented Jan 5, 2023

janjust commented Jan 6, 2023

darrylabbate Jan 6, 2023

wckzhang Jan 9, 2023

nysal left a comment

devreal Jan 7, 2023

wckzhang Jan 9, 2023

devreal Jan 7, 2023

wckzhang Jan 9, 2023

devreal left a comment

wckzhang commented Jan 10, 2023

devreal left a comment

accelerator/cuda: Add delayed initialization logic #11253

accelerator/cuda: Add delayed initialization logic #11253

Conversation

wckzhang commented Dec 28, 2022

wckzhang commented Dec 28, 2022

nysal commented Jan 3, 2023

wckzhang commented Jan 4, 2023

edgargabriel commented Jan 4, 2023

edgargabriel left a comment

Choose a reason for hiding this comment

wckzhang commented Jan 5, 2023

nysal commented Jan 5, 2023 • edited Loading

edgargabriel commented Jan 5, 2023

wckzhang commented Jan 5, 2023

janjust commented Jan 6, 2023

darrylabbate Jan 6, 2023

Choose a reason for hiding this comment

wckzhang Jan 9, 2023

Choose a reason for hiding this comment

nysal left a comment

Choose a reason for hiding this comment

devreal Jan 7, 2023

Choose a reason for hiding this comment

wckzhang Jan 9, 2023

Choose a reason for hiding this comment

devreal Jan 7, 2023

Choose a reason for hiding this comment

wckzhang Jan 9, 2023

Choose a reason for hiding this comment

devreal left a comment

Choose a reason for hiding this comment

wckzhang commented Jan 10, 2023

devreal left a comment

Choose a reason for hiding this comment

nysal commented Jan 5, 2023 •

edited

Loading