Skip to content

Commit

Permalink
[aDAG] support buffered input (ray-project#47272)
Browse files Browse the repository at this point in the history
\Based on https://docs.google.com/document/d/1Ka_HFwPBNIY1u3kuroHOSZMEQ8AgwpYciZ4n08HJ0Xc/edit

When there are many in-flight requests (pipelining inputs to the DAG), 2 problems occur.

Input submitter timeout. InputSubmitter.write() waits until the buffer is read from downstream tasks. Since timeout count is started as soon as InputSubmitter.write() is called, when there are many in-flight requests, the later requests are likely to timeout.
Pipeline bubble. Output fetcher doesn’t read the channel until CompiledDagRef.get is called. It means the upstream task (actor 2) has to be blocked until .get is called from a driver although it can execute tasks.
This PR solves the problem by providing multiple buffer per shm channel. Note that the buffering is not supported for nccl yet (we can do it when we overlap compute/comm).

Main changes

Introduce BufferedSharedMemoryChannel which allows to create multiple buffers (10 by default). Read/write is done in round robin manner.
When you have more in-flight request than the buffer size, Dag can still have timeout error. To make debugging easy and behavior straightforward, we introduce max_buffered_inputs_ argument. If there are more than max_buffered_inputs_ requests submitted to the dag without ray.get, it immediately raises an exception.

Signed-off-by: ujjawal-khare <[email protected]>
  • Loading branch information
rkooo567 authored and ujjawal-khare committed Oct 15, 2024
1 parent e983f0c commit 3e507d1
Showing 1 changed file with 0 additions and 0 deletions.
Empty file added serve/serve_40797.log
Empty file.

0 comments on commit 3e507d1

Please sign in to comment.