Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How To Use MPIX_Stream? #6517

Closed
Jacobfaib opened this issue May 12, 2023 · 14 comments
Closed

How To Use MPIX_Stream? #6517

Jacobfaib opened this issue May 12, 2023 · 14 comments

Comments

@Jacobfaib
Copy link

Jacobfaib commented May 12, 2023

I am a developer for PETSc (down the hall), and am working on integrating an experimental GPU stream-aware MPI layer. We are using MPICH’s experimental MPIX_Stream extension, but are running into some difficulties/have some questions.

For reference, the relevant PETSc code for this new feature can be found in the function here

I have summarized the biggest ones below, but I am sure there will be many more...

Just to refresh everyone's memory, you must do

MPI_Info    info;
MPIX_Stream mpi_stream;
MPI_Comm    scomm;

MPI_Info_create(&info);
MPI_Info_set(info, ...);
MPIX_Info_set_hex(info, ...);
MPIX_Stream_create(info, &mpi_stream);
MPI_Info_free(&info);

MPIX_Stream_comm_create(src_comm, mpi_stream, &scomm);
// scomm can now be used

———

  1. Lifetimes

1.1. What is the lifetime of the MPIX_Stream vis-a-vis the stream comm (scomm)? Can it outlive scomm? If not, what order should we destroy mpi_stream and scomm?

1.2. What is the lifetime of the MPIX_Stream vis-a-vis the underlying device stream? I assume that the MPIX_Stream must be destroyed immediately before the CUDA stream is destroyed.

1.3. Do we need a brand new scomm for each cudaStream_t - MPI_Comm pairing? Reading through MPIR_Stream_comm_create_impl() this appears to be the case.

1.4. Following on from 1.3, can we reuse the same MPIX_Stream to creating multiple separate src_comms (but same underlying CUDA stream)?

1.5. Following on again from 1.3, any way to make a non-stream communicator into a stream comm in-place? I.e. something like MPIX_Set_stream(PETSC_COMM_WORLD, some_mpix_stream)?

———

  1. I have run into this error
$ mpiexec -n 2 ./my_app
...
MPI error 940767503 Other MPI error, error stack:
internalX_Stream_create(72686): MPIX_Stream_create(info=0x9c000000, stream=0x55c2ab8608b8) failed
MPIR_Stream_create_impl(206)..:
MPID_Allocate_vci(590)........: No streams available. Use MPIR_CVAR_CH4_RESERVE_VCIS to reserve the number of streams can be allocated.

Which I dutifully rectify by doing

$ MPIR_CVAR_CH4_RESERVE_VCIS=8675309 mpiexec -n 2 ./my_app

But this seems clunky, and leads to me to believe I have not done some other necessary setup. Why the environment variable? To me this implies some one-time setup, likely in MPI_Init(). Surely there is a way to dynamically allocate more streams while the application is running?

@Jacobfaib
Copy link
Author

cc @jczhang07

@hzhou
Copy link
Contributor

hzhou commented May 12, 2023

1.1 MPIX_Stream lives before and after the stream communicators that are using the stream.
1.2 The cuda stream lives before and after the MPIX_Stream that bind with it.
1.3 Yes.
1.4 Yes.
1.5 MPIX_Stream is local objects, just as CUDA stream. The stream communicator on the other hand is a collective object. So the proposed MPIX_Set_stream need be collective. Convince us that this function is necessary and we can add it.

2 Each MPIX Stream is associated with distinct internal network endpoint to isolate communication resources. The network endpoint need be initialized for connections. It is the establishing connections part that prevents the dynamic allocation

@Jacobfaib
Copy link
Author

MPIX_Stream lives before and after the stream communicators that are using the stream.

To be precise, the MPIX_Stream must outlive the comm, or can outlive the comm?

MPIX_Stream is local objects, just as CUDA stream. The stream communicator on the other hand is a collective object. So the proposed MPIX_Set_stream need be collective. Convince us that this function is necessary and we can add it.

It's not so much necessary as it is a convenience :). Consider the following use-case:

PetscErrorCode VecDoSomethingCollective(Vec v, ...)
{
  MPI_Comm comm;
  MPI_Comm scomm;
  MPIX_Stream mpi_stream;

  // every PETSc object (e.g. Vec) holds a corresponding communicator
  PetscObjectGetComm(v, &comm);
  PetscGetMPIXStream(cuda_stream, &mpi_stream);
  MPIX_Stream_comm_create(comm, mpi_stream, &scomm);
  MPIX_Allreduce_enqueue(..., scomm);
  // Performant code probably should now cache scomm somewhere, likely needs reference counting to ensure
  // it is properly managed. This adds more complexity
}

We already have a comm here, and we are about to do a collective communication. Naively, there is no reason why we need what is essentially a copy of comm in scomm. Allowing users to convert in place is more ergonomic:

PetscErrorCode VecNorm_CUDA(Vec v, ..., PetscScalar *norm)
{
  MPI_Comm     comm;
  MPIX_Stream mpi_stream;

  PetscObjectGetComm(v, &comm);
  PetscGetMPIXStream(cuda_stream, &mpi_stream);
  MPIX_Comm_set_stream(comm, mpi_stream);
  MPIX_Allreduce_enqueue(..., comm);
  // no need to deal with extra comm object!
}

MPIX_Comm_set_stream() should of course be idempotent (if not, there should be a way of testing whether the current stream is the same).

It is the establishing connections part that prevents the dynamic allocation

Can you explain this limitation more directly? I am not at all familiar with networking.

@Jacobfaib
Copy link
Author

Also, is there a invalid MPIX_Stream equivalent a la MPI_COMM_NULL? MPIX_STREAM_NULL exists but appears to be a valid object. We would like to check whether a MPIX_Stream has been previously created. Currently we use an extra variable to hold this state:

struct PetscMPIXStream 
{
  bool        init{};
  MPIX_Stream stream{};
};

void foo(PetscMPIXStream &strm) 
{
  if (!strm.init) {
    MPIX_Stream_create(..., &strm.stream);
    strm.init = true;
  }
  // use strm.stream...
}

Ideally we would want something like

void foo(MPIX_Stream &strm)
{
  if (strm == MPIX_STREAM_INVALID) { // for example
    MPIX_Stream_create(..., &strm);
  }
  // use strm...
}

@hzhou
Copy link
Contributor

hzhou commented May 12, 2023

  • Yes. The MPIX_Stream must outlive the stream communicator. Actually, we reference count the MPIX_Stream, so users may free the MPIX_Stream before freeing the communicator, but the stream still outlives the communicator.

  • MPIX_Comm_set_stream will be as expensive as MPIX_Stream_comm_create. The reason is that a communicator is a collective object and stream ties to the network endpoint. For the stream communicator to work, it also needs to know the remote target endpoint.

Petsc currently caches an internal duplicated comm, and I think it will need to cache a separate stream communicator. After all, the regular comm and stream comm are used very differently and it is cleaner not to mix them up.

  • Each new dynamically added endpoint needs to make connections to all existing remote endpoints. I guess it is not impossible, but very difficult to do without making the existing code significantly more complicated.

  • We can add MPIX_STREAM_INVALID

@Jacobfaib
Copy link
Author

Jacobfaib commented May 12, 2023

MPIX_Comm_set_stream will be as expensive as MPIX_Stream_comm_create... After all, the regular comm and stream comm are used very differently and it is cleaner not to mix them up.

Hmmm, this implies you cannot use a stream comm in place of a regular comm? I.e.

MPIX_Stream_comm_create(..., &scomm);
MPI_Allreduce(..., scomm);

Petsc currently caches an internal duplicated comm, and I think it will need to cache a separate stream communicator.

Indeed. The code snippet linked in the description stores all created objects in a

std::unordered_map<cudaStreamId_t, std::unordered_map<MPI_Comm, MPI_Comm>> stream_comm_cache;

MPI_Comm scomm = stream_comm_cache[cuda_stream.get_id()][src_comm];

Each new dynamically added endpoint needs to make connections to all existing remote endpoints. I guess it is not impossible, but very difficult to do without making the existing code significantly more complicated.

What should the value of MPIR_CVAR_CH4_RESERVE_VCIS be then? The number of distinct cudaStream_t created? Or the number of cudaStream_t - MPI_Comm combinations? If it is either of these, then this is difficult to predict. The CUDA (and HIP) stream model is explicitly fire-and-forget. PETSc also adopts this model. Users are able to freely create and destroy many streams. You cannot know upfront how many streams will be created.

Suppose however that we have decided on a value for it. How should external code (i.e. PETSc) go about setting this variable? It leaks the abstraction to ask users to set MPIR_CVAR_CH4_RESERVE_VCIS=whatever themselves before running their code. Something like

setenv("MPIR_CVAR_CH4_RESERVE_VCIS", "whatever");
MPI_Init(&argc, &argv);

is also unsavory. It means that PETSc must be the one to initialize MPI. We currently allow users to initialize it themselves prior to PetscInitialize(), how can this be squared?

@hzhou
Copy link
Contributor

hzhou commented May 12, 2023

Hmmm, this implies you cannot use a stream comm in place of a regular comm? I.e.

That is correct. Actually, there are regular stream comms with regular streams (not CUDA streams). The regular stream comms can be used in place of a regular comm. The stream comms that are attached with CUDA streams have different semantics, i.e. all operations need be asynchronously queued to the CUDA stream, thus they are not interchangeable with regular comms.

What should the value of MPIR_CVAR_CH4_RESERVE_VCIS be then?
You only need 1 reserved vci for all gpu stream backed MPIX_streams.

unsavory

We are well aware of the inconvenience, and I think it is possible to make it a bit implicit or dynamic. First we are focusing on the usability and functionality, then we will address the convenience part especially those that require more effort.

@Jacobfaib
Copy link
Author

You only need 1 reserved vci for all gpu stream backed MPIX_streams.

Ah, that is much easier to handle then!

We are well aware of the inconvenience, and I think it is possible to make it a bit implicit or dynamic. First we are focusing on the usability and functionality, then we will address the convenience part especially those that require more effort.

Of course. On this note, we would be more than happy to collaborate closely to help iron out any kinks. I am sure a good stress test of MPIX_Stream and friends would be useful to you (not to mention helpful for to convince the MPI forum for adoption ;)).

@Jacobfaib
Copy link
Author

Jacobfaib commented May 13, 2023

I am getting

Assertion failed in file src/mpi/stream/stream_enqueue.c at line 467: enqueue_req && enqueue_req->kind == MPIR_REQUEST_KIND__ENQUEUE

what does this error indicate? I am doing (roughly)

MPIX_Irecv_enqueue(...);
...
MPIX_Waitall_enqueue(...); // error fires here

(@jczhang07 this is error is firing from PetscSFBcastEnd() -> PetscSFLinkWaitEnqueue_MPIX_Stream() -> MPIX_Waitall_enqueue(), perhaps you have an idea?)

Note that I do not get this error if I cudaDeviceSynchronize() between Irecv and Waitall (or in SF parlance, -sf_unknown_input_stream)

@hzhou
Copy link
Contributor

hzhou commented May 15, 2023

In MPIX_Waitall_enqueue, all requests has to be enqueued requests. The assertin failure is due to one of the request being a normal (non-enqueued) request.

@Jacobfaib
Copy link
Author

Jacobfaib commented May 15, 2023

@hzhou

TL;DR: Performance degradation due to mutex/condition variable thrashing after fixed time period. 2 MPI ranks, communication pattern is regular (does not change), and all async. No performance degradation when using regular MPI calls.

Any ideas on what the culprit might be? Unclear whether MPICH or CUDA runtime is to blame.


Consider:

for (int i = 0; i < 100; ++i) {
  VecScaleAsync(x, ...); // effectively cublasDscal(...);
  MPIX_Allreduce_enqueue(MPI_IN_PLACE, device_ptr, 1, MPIU_SCALAR, MPIU_SUM, scomm);
}

For the first few iterations both ranks chug along normally, but eventually hit some trigger point which causes the kernel times to balloon. You can see the thrashing in the screenshot below, since we suddenly get massive kernel times (blue blocks) where previously they were much smaller. Note that VecScaleAsync() performs no additional communication or device synchronization.
image


Furthermore, if we modify the loop to

for (int i = 0; i < 100; ++i) {
  MPIX_Allreduce_enqueue(MPI_IN_PLACE, device_ptr, 1, MPIU_SCALAR, MPIU_SUM, scomm);
}

We also reproduce a noticeable slowdown, this time manifesting in cudaMemcpyAsync() (the thick red blocks where previously they were slim, and hidden between grey).
image

Do you have any ideas on what the culprit might be? They appear to be thrashing a shared condition variable repeatedly? Unclear whether this is from the global CUDA context mutex, or some synchronization variable inside MPICH, or why this behavior emerges when it does.

reproducer.zip

@hzhou
Copy link
Contributor

hzhou commented May 15, 2023

The default is using cudaLaunchHostFunc to call MPI communication/progress on the host side. I believe cudaLaunchHostFunc is not optimized and likely involves heavy synchronization, but it is relatively robust. Next thing to try is to use the workQ implementation, which launches a background thread and directly synchronizes with kernels via atomic variables -- See #6062.
Also see c4a41e0 for usage. In particular, the app need launch the progress thread manually (so we don't overscribe the cores unintentionally) and need set environment variables --

MPIR_CVAR_CH4_RESERVE_VCIS=1
MPIR_CVAR_CH4_ENABLE_STREAM_WORKQ=1 
MPIR_CVAR_GPU_HAS_WAIT_KERNEL=1

This approach involves a wait kernel -- a cuda kernel that busy waits on an external atomic variables. We are hitting some deadlock issues. This is likely because CUDA runtime is unaware of the dependency and may have extra locks or synchronizations that is causing the deadlock. Effort is needed to pin-down the issue and work out a mechanism to ensure the robutsness, as well as verifying the performance.

@Jacobfaib
Copy link
Author

Jacobfaib commented May 15, 2023

Effort is needed to pin-down the issue and work out a mechanism to ensure the robutsness, as well as verifying the performance.

OK I have done some more digging, and consulted with some NVIDIA representatives. Note that none of this is explicitly confirmed -- mostly based on what I can glean from the logs -- but I am pretty certain it is the case.

The CUDA runtime seems to have a queue (apparently per-stream) which each kernel launch appends to. This queue is of fixed size and once it fills up, subsequent kernel launches block until a slot opens up. This is precisely the pthread_cond_wait stall; as you know, multithreaded queues usually have a std::condition_variable ready; that both the producers and consumers use to signal to each other.

If I add cudaDeviceSynchronize() say every 20 iterations you will note that kernel launch times stay low afterwards.
image
Furthermore if we remove the cudaDeviceSynchronize() and instead rotate streams every 20 iterations:
image
The difference is even more apparent, hence my belief that this queue limit is per stream.

@Jacobfaib
Copy link
Author

In particular, the app need launch the progress thread manually (so we don't overscribe the cores unintentionally)

OK I've also tried this, but am running into errors. For reference I am calling MPIX_Start_progress_thread(mpi_stream) immediately after each MPIX_Stream is created, and calling MPIX_Stop_progress_thread(mpi_stream) immediately before each MPIX_Stream is destroyed. Is this order wrong?

[1684169539.768883] [petsc-gpu-01:1860750:1]           debug.c:1289 UCX  WARN  ucs_debug_disable_signal: signal 8 was not set in ucs
[1684169539.768900] [petsc-gpu-01:1860750:1]           debug.c:1289 UCX  WARN  ucs_debug_disable_signal: signal 4 was not set in ucs
[petsc-gpu-01:1860750:0:1860976]  ucp_worker.c:2781 Assertion `worker->inprogress++ == 0' failed
[petsc-gpu-01:1860750:1:1860977]  ucp_worker.c:2786 Assertion `--worker->inprogress == 0' failed
[petsc-gpu-01:1860749:0:1860975]  ucp_worker.c:2786 Assertion `--worker->inprogress == 0' failed
==== backtrace (tid:1860976) ====
 0  /home/ac.jfaibussowitsch/petsc/main-arch-cuda-opt/lib/libucs.so.0(ucs_debug_print_backtrace+0x39) [0x7fbc187c2fd9]
 1  /home/ac.jfaibussowitsch/petsc/main-arch-cuda-opt/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fbc187c5604]
 2  /home/ac.jfaibussowitsch/petsc/main-arch-cuda-opt/lib/libucs.so.0(ucs_fatal_error_message+0xca) [0x7fbc187c255a]
 3  /home/ac.jfaibussowitsch/petsc/main-arch-cuda-opt/lib/libucs.so.0(ucs_fatal_error_format+0x122) [0x7fbc187c2682]
 4  /home/ac.jfaibussowitsch/petsc/main-arch-cuda-opt/lib/libucp.so.0(ucp_worker_progress+0x98) [0x7fbc26f11348]
 5  /scratch/jfaibussowitsch/petsc/main-arch-cuda-opt/lib/libmpi.so.0(+0x94e242) [0x7fbc3600f242]
 6  /scratch/jfaibussowitsch/petsc/main-arch-cuda-opt/lib/libmpi.so.0(+0x94e88f) [0x7fbc3600f88f]
 7  /scratch/jfaibussowitsch/petsc/main-arch-cuda-opt/lib/libmpi.so.0(+0x9fa1fb) [0x7fbc360bb1fb]
 8  /opt/nvidia/nsight-systems/2022.1.3/target-linux-x64/libToolsInjection64.so(+0x4d635d) [0x7fb9d4b1335d]
 9  /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7fbc3524cb43]
10  /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7fbc352dea00]

@pmodels pmodels locked and limited conversation to collaborators Jun 13, 2023
@hzhou hzhou converted this issue into discussion #6559 Jun 13, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants