Merge branch 'dag-2' into dag

stephanie-wang · Nov 23, 2023 · d3e10fa · d3e10fa
2 parents a87e0f5 + 86ec2ac
commit d3e10fa
Show file tree

Hide file tree

Showing 3 changed files with 24 additions and 7 deletions.
diff --git a/TODO.md b/TODO.md
@@ -13,19 +13,35 @@
 [x] VLLM + DAG
     - Implement OutputNode
     - Fix DAG API
-[ ] Serialization edge cases -> Sang
-    - Implement pinning for plasma client objects (so that we don't need the hack to pin in Python memory, which only works for zero-copy objects)
-    - Implement data size header to account for different-size objects
-[ ] Connect actors and driver -> Stephanie
+[x] Connect actors and driver -> Stephanie
     - Receiver CoreWorker actor selects between normal task queue and signals from shared-memory objects
         - Implement a DependencyWaiter that uses the alternative plasma client to get objects
+[x] Serialization edge cases -> Sang
+    - [x] Implement data size header to account for different-size objects
+
+
 [ ] Performance optimization -> Eric
-    - Specialize Get/Put path
-    - Specialize Execution
+    - [ ] Implement all optimization
+    - Specialize Get/Put path -> not working
+    - Specialize Execution -> not working
+[ ] Input should accept multi args -> Stephanie
+[ ] Make it work with multi-reader for tp > 1 -> Stephanie
+    - Add assertion when the data size is too big compared to max buffer size
+[ ] Verify vllm works with existing setup
+[ ] Exception / failure handling with VLLM and accelerated DAG
+    [ ] exception -> Stephanie
+    [ ] failure handling (maybe use ray.get on the task output) -> Sang
+
+
+Next week
+[ ] Handle different metadata size
+[ ] Make it work with Mac or do not build
 [ ] Harden shared-memory based Seal
     - [ ] edge case: object aborted/client dies while waiting for seal
     - [ ] fix plasma client ref counting
     - [ ] disable ref counting for real
+    - Implement pinning for plasma client objects (so that we don't need the hack to pin in Python memory, which only works for zero-copy objects)
+[ ] ray.release on a list causes segfault
 
 Limitation:
 - Not working well with regular actor tasks

diff --git a/python/ray/dag/compiled_dag_node.py b/python/ray/dag/compiled_dag_node.py
@@ -5,7 +5,7 @@
 import ray
 
 
-MAX_BUFFER_SIZE = 100
+MAX_BUFFER_SIZE = 1 * 1e9
 
 def allocate_shared_output_buffer(buffer_size_bytes: int = MAX_BUFFER_SIZE):
     ref = ray.put(b"0" * buffer_size_bytes, max_readers=1)

diff --git a/python/ray/dag/dag_node.py b/python/ray/dag/dag_node.py
@@ -131,6 +131,7 @@ def execute(
                 - resolved values representing user input at runtime
         """
         if compiled:
+            print(args)
             assert len(args) == 1, "Compiled DAGs support exactly one InputNode arg"
             input_ref, input_max_readers, output_ref = self.compile()
             ray.worker.global_worker.put_object(