Memory scheduling deadlock #11

jaewan · 2022-03-03T22:11:46Z

Why are these changes needed?

Implementation of Deadlock Detection1.
This checks the number of leased_workers and spinning_tasks(blocked tasks from triggering spill)
When the numbers match, consider it as a deadlock and trigger spill.

…ict callback

… stuck with producers and 2 is when consumers are dependent on multiple objects

stephanie-wang

Two things:

Ideally, we shouldn't leak the number of spilled workers to the plasma request queue. Instead, can we do this logic in the task manager? The reason is that we already has a way for the task manager to notify the request queue when to block/allow spill, so we can build off of this codepath instead of creating a new one.
Can you add a test in Python that demonstrates the new behavior? We want one that previously did not work when blocking spills was enabled, and now it does work. Also, we should compare the runtime to the default Ray implementation that always spills.

…worker to the obj store

jaewan · 2022-03-30T17:52:44Z

Two things:

Ideally, we shouldn't leak the number of spilled workers to the plasma request queue. Instead, can we do this logic in the task manager? The reason is that we already has a way for the task manager to notify the request queue when to block/allow spill, so we can build off of this codepath instead of creating a new one.

Can you add a test in Python that demonstrates the new behavior? We want one that previously did not work when blocking spills was enabled, and now it does work. Also, we should compare the runtime to the default Ray implementation that always spills.

I changed the implementation to trigger spill from task_manager. This is done by setting should_spill_ in object store via io_post.
!!! Can you check this Stephanie? is this design OK or should I get back to previous design?
1.1 I also changed the setting should_spill_ from task_manager happens only when it is true (1. When #spinning_workers==#leased_workers, 2. When evict_tasks() called but did not evict anything). In previous design should_spill_ is always set after calling evict_tasks() and block_spill(). So no need for resetting should_spill_ from the object store. Advantage of this is since the calls are asynchronous, there's a time difference between setting should_spill_ from the task_manager to object_store. More spill() will be trigger in the meantime. But I thought it's wasting communication cost so changed the implementation. When should_spill_ is set, object_manager triggers spill once and set it off so task_manager doesn't need to reset it.
We have a separate PR for test scripts. Do you want me to add test of this PR here or to the test script PR?

stephanie-wang · 2022-04-07T22:18:46Z

src/ray/object_manager/plasma/create_request_queue.cc

-	      on_object_creation_blocked_callback_(lowest_pri, enable_blocktasks , enable_evicttasks);
+		if(enable_blocktasks_spill){
+          RAY_LOG(DEBUG) << "[JAE_DEBUG] task " << task_id << " spins"; 
+		  for(auto it = queue_.begin(); it != queue_.end(); it++){


Can't we just use queue.size()?

stephanie-wang · 2022-04-07T22:19:55Z

The implementation looks much better!

We have a separate PR for test scripts. Do you want me to add test of this PR here or to the test script PR?

Can you add just one script that previously deadlocked and now works with this PR?

This reverts commit 72e3543.

We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is: ``` #0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /lib64/libstdc++.so.6 #1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2 #7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2 #8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2 #11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2 #12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6 #14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2 #15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 #16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>) at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369 ``` The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`). It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`. The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though. BTW, I've tried different approaches: 1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well. 2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.

jaewan and others added 25 commits November 17, 2021 22:42

blocktask and evict tasks separated

9e74f48

blocktasks when oom

2c89769

command to replay the bug

cfc7a73

Fix crash in raylet on duplicate object

8176e99

Fixed bug in blocktasks. Did not block when block task called only

25136da

Merged blocktasks and evicttasks into a single callback and erased ev…

020dc20

…ict callback

get lowest pri, but obj does not have pri yet

b90de27

debug logs to see where to put priority

5d281a1

log without compile error

2373c86

commit for migration

dc06d25

working version. But still need testing

0887544

Blocktasks but 2 new problems emerged

840c853

added microbechmark

c195664

deadlock cases 1 and 2 induced by not spilling. 1 is when workers are…

c23030a

… stuck with producers and 2 is when consumers are dependent on multiple objects

polished debug messages

4aaccc5

typo fix

74b6780

removed scripts and included block, evict threshold triggered code

4e8527b

deadlock detection by checking #of leased workers and spinning tasks

252c7c9

reflected changes from branch -jae

c3a211f

reflected changes from branch -jae

7cd9734

deadlock detection#1

f669e80

reflected changes from parent branch

a5eedc2

.

c390637

Merge branch 'memory-scheduling' into memory-scheduling-deadlock

0fecb16

Merge branch 'memory-scheduling' into memory-scheduling-deadlock

98d67e8

stephanie-wang requested changes Mar 24, 2022

View reviewed changes

stephanie-wang self-assigned this Mar 25, 2022

changed triggering spill from task_manager instead of giving #leased_…

0f8bab0

…worker to the obj store

Spinning task identificatino by queue_ version.

8aedbae

Merge branch 'memory-scheduling' into memory-scheduling-deadlock

2bfe886

stephanie-wang reviewed Apr 7, 2022

View reviewed changes

Jaewan Hong and others added 3 commits May 6, 2022 16:14

test rpc to core worker for object working set

72e3543

Revert "test rpc to core worker for object working set"

d0daa0b

This reverts commit 72e3543.

test caese, will be updated to pytest

1498c20

stephanie-wang merged commit 9bade78 into memory-scheduling May 24, 2022

stephanie-wang deleted the memory-scheduling-deadlock branch May 24, 2022 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory scheduling deadlock #11

Memory scheduling deadlock #11

jaewan commented Mar 3, 2022 •

edited

Loading

stephanie-wang left a comment

jaewan commented Mar 30, 2022

stephanie-wang Apr 7, 2022

stephanie-wang commented Apr 7, 2022 •

edited

Loading

Memory scheduling deadlock #11

Memory scheduling deadlock #11

Conversation

jaewan commented Mar 3, 2022 • edited Loading

Why are these changes needed?

stephanie-wang left a comment

Choose a reason for hiding this comment

jaewan commented Mar 30, 2022

stephanie-wang Apr 7, 2022

Choose a reason for hiding this comment

stephanie-wang commented Apr 7, 2022 • edited Loading

jaewan commented Mar 3, 2022 •

edited

Loading

stephanie-wang commented Apr 7, 2022 •

edited

Loading