-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVIDIA GPU SPMD] Add runtime support to run windowed einsum in multiple streams #8707
Conversation
Could you add a bit more context/benchmarks to the commit description on what workflows does this make faster? |
With CUDA graphs this should happen automatically without any need to add user annotations to operations, I'd suggest to try it first to see if you get any perf improvements from that with graphs. In general it's ok to add execution stream to thunk (but without AsyncExecutors) and instead add new thunk:
And in |
Yes, I have experimented with this before. But it doesnt give as much perf benefit as specifically running kernels in separate streams. I'm trying to understand the new thunk mechanism, are you proposing to use JoinExecutionThreadsThunk to wrap the execution of thunk sequence within the while thunk or is that just a synchronization scheme in the graph? Maybe a pseudo hlo graph could also be helpful. |
Basically the proposal is:
In
emits
The tricky part is how to tell buffer assignemtn not to reuse buffers:
BufferAssignment will assign the same buffer for scratch allocation and it is unsafe to run these gemms in parallel. VLOG level of 100 for |
Ah ok, so we basically make the async executor into a thunk to be more explicit. I think i can give this a try. I think for the buffer assignment, we can tell it to re-use buffer only for kernels with the same exe_id? As for the concurrent region, I dumped the graph, it looks like each of the gemm has a graph instance. looks like the concurrent region wont work with scratch allocations |
Proper XLA solution at HLO level would be to put async-start and async-done around gemm custom call (dot instruction). At Thunk level ExecutionThreadId looks like a reasonable abstraction. With async wrappers buffer assignment will work out of the box. |
The correct XLA-way of doing this would be something like this:
This IR has syntactic sugar in XLA, and will be printed like:
With this representation you can assign execution threads to individual gemms, and at IR emitter stage we'll have lowering for AsyncStart of CustomCall operation like:
and AsyncDone will be lowered to
This is the way XLA is going with async instructions and this will have out of the box support by buffer assignment. See more details here: https://github.com/openxla/xla/blob/main/docs/async_ops.md Stream ids attributes in backend configs are hacks that will require even more hacks to make buffer assignment work propertly. With correct async representation it will be lowered to correct thunk sequence and also will have correct lowering to commands (CUDA graphs) with proper DAG. |
I have created a separate discussion on openxla forum here to describe the motivation and design. |
I have included your suggestion in the discussion here. I think the AsyncExecutor is deprecated so I removed that alternative. I have the basic functionality working for the HLO graph part, trying to get the async thunks working to run it. |
391ff8a
to
11136d5
Compare
11136d5
to
bf1341f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't add notes for all namespaces, but with c++17 we prefer namespace xla::gpu
for new code
xla/service/gpu/thunk.cc
Outdated
return params.stream; | ||
} | ||
auto iter = params.additional_compute_streams.find(stream_id); | ||
CHECK(iter != params.additional_compute_streams.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new checks are strongly discouraged in XLA, absl::StatusOr<se::Stream*>
is preferred result type here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i changed it to return an invalidArg error instead of a CHECK.
bf1341f
to
e9c44f7
Compare
81b450f
to
2a87790
Compare
2a87790
to
81119b3
Compare
81119b3
to
1fdad5c
Compare
There are few more build time errorrs:
|
hopefully the latest commit will fix them. |
added emitter logic fot SyncOnStreamsThunk
725ab0f
to
70068f9
Compare
…m in multiple streams Imported from GitHub PR openxla/xla#8707 This PR contains the runtime changes to be able to run windowed einsum in multiple cuda streams. This pr follows(openxla/xla#7854) which adds stream attributes to the HLO graph. We take the stream attributes and dispatch corresponding kernels to separate cuda streams. We do this by wrapping the kernel with an asyncStartDone pair and non-default stream id. The emitter will emit a SyncOnStreamsThunk and then the kernel's thunk for AsyncStart. For asyncStartDone, it will just emit SyncOnStreamsThunk. Detailed discussion [here](openxla/xla#8865). Copybara import of the project: -- 22320215797434b0f507aeb93872c174cd4499c5 by TJ <[email protected]>: Added a thunk for stream synchronization added emitter logic fot SyncOnStreamsThunk -- 70068f9fc81343a2e0c2b48d05ad98df1c9d70ab by TJ <[email protected]>: attempting to fix build errors Merging this change closes #8707 PiperOrigin-RevId: 604215385
This PR contains the runtime changes to be able to run windowed einsum in multiple cuda streams.
This pr follows(#7854) which adds stream attributes to the HLO graph.
We take the stream attributes and dispatch corresponding kernels to separate cuda streams.
We do this by wrapping the kernel with an asyncStartDone pair and non-default stream id.
The emitter will emit a SyncOnStreamsThunk and then the kernel's thunk for AsyncStart. For asyncStartDone, it will just emit SyncOnStreamsThunk.
Detailed discussion here.