Skip to content

Commit

Permalink
Merge branch 'dag-2' into dag
Browse files Browse the repository at this point in the history
  • Loading branch information
rkooo567 committed Nov 23, 2023
2 parents a87e0f5 + 86ec2ac commit d3e10fa
Show file tree
Hide file tree
Showing 3 changed files with 24 additions and 7 deletions.
28 changes: 22 additions & 6 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,35 @@
[x] VLLM + DAG
- Implement OutputNode
- Fix DAG API
[ ] Serialization edge cases -> Sang
- Implement pinning for plasma client objects (so that we don't need the hack to pin in Python memory, which only works for zero-copy objects)
- Implement data size header to account for different-size objects
[ ] Connect actors and driver -> Stephanie
[x] Connect actors and driver -> Stephanie
- Receiver CoreWorker actor selects between normal task queue and signals from shared-memory objects
- Implement a DependencyWaiter that uses the alternative plasma client to get objects
[x] Serialization edge cases -> Sang
- [x] Implement data size header to account for different-size objects


[ ] Performance optimization -> Eric
- Specialize Get/Put path
- Specialize Execution
- [ ] Implement all optimization
- Specialize Get/Put path -> not working
- Specialize Execution -> not working
[ ] Input should accept multi args -> Stephanie
[ ] Make it work with multi-reader for tp > 1 -> Stephanie
- Add assertion when the data size is too big compared to max buffer size
[ ] Verify vllm works with existing setup
[ ] Exception / failure handling with VLLM and accelerated DAG
[ ] exception -> Stephanie
[ ] failure handling (maybe use ray.get on the task output) -> Sang


Next week
[ ] Handle different metadata size
[ ] Make it work with Mac or do not build
[ ] Harden shared-memory based Seal
- [ ] edge case: object aborted/client dies while waiting for seal
- [ ] fix plasma client ref counting
- [ ] disable ref counting for real
- Implement pinning for plasma client objects (so that we don't need the hack to pin in Python memory, which only works for zero-copy objects)
[ ] ray.release on a list causes segfault

Limitation:
- Not working well with regular actor tasks
Expand Down
2 changes: 1 addition & 1 deletion python/ray/dag/compiled_dag_node.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
import ray


MAX_BUFFER_SIZE = 100
MAX_BUFFER_SIZE = 1 * 1e9

def allocate_shared_output_buffer(buffer_size_bytes: int = MAX_BUFFER_SIZE):
ref = ray.put(b"0" * buffer_size_bytes, max_readers=1)
Expand Down
1 change: 1 addition & 0 deletions python/ray/dag/dag_node.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ def execute(
- resolved values representing user input at runtime
"""
if compiled:
print(args)
assert len(args) == 1, "Compiled DAGs support exactly one InputNode arg"
input_ref, input_max_readers, output_ref = self.compile()
ray.worker.global_worker.put_object(
Expand Down

0 comments on commit d3e10fa

Please sign in to comment.