Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory scheduling deadlock #11

Merged
merged 31 commits into from
May 24, 2022

Conversation

jaewan
Copy link
Collaborator

@jaewan jaewan commented Mar 3, 2022

Why are these changes needed?

Implementation of Deadlock Detection1.
This checks the number of leased_workers and spinning_tasks(blocked tasks from triggering spill)
When the numbers match, consider it as a deadlock and trigger spill.

Copy link
Owner

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things:

  1. Ideally, we shouldn't leak the number of spilled workers to the plasma request queue. Instead, can we do this logic in the task manager? The reason is that we already has a way for the task manager to notify the request queue when to block/allow spill, so we can build off of this codepath instead of creating a new one.
  2. Can you add a test in Python that demonstrates the new behavior? We want one that previously did not work when blocking spills was enabled, and now it does work. Also, we should compare the runtime to the default Ray implementation that always spills.

@stephanie-wang stephanie-wang self-assigned this Mar 25, 2022
@jaewan
Copy link
Collaborator Author

jaewan commented Mar 30, 2022

Two things:

  1. Ideally, we shouldn't leak the number of spilled workers to the plasma request queue. Instead, can we do this logic in the task manager? The reason is that we already has a way for the task manager to notify the request queue when to block/allow spill, so we can build off of this codepath instead of creating a new one.
  2. Can you add a test in Python that demonstrates the new behavior? We want one that previously did not work when blocking spills was enabled, and now it does work. Also, we should compare the runtime to the default Ray implementation that always spills.
  1. I changed the implementation to trigger spill from task_manager. This is done by setting should_spill_ in object store via io_post.
    !!! Can you check this Stephanie? is this design OK or should I get back to previous design?
    1.1 I also changed the setting should_spill_ from task_manager happens only when it is true (1. When #spinning_workers==#leased_workers, 2. When evict_tasks() called but did not evict anything). In previous design should_spill_ is always set after calling evict_tasks() and block_spill(). So no need for resetting should_spill_ from the object store. Advantage of this is since the calls are asynchronous, there's a time difference between setting should_spill_ from the task_manager to object_store. More spill() will be trigger in the meantime. But I thought it's wasting communication cost so changed the implementation. When should_spill_ is set, object_manager triggers spill once and set it off so task_manager doesn't need to reset it.

  2. We have a separate PR for test scripts. Do you want me to add test of this PR here or to the test script PR?

on_object_creation_blocked_callback_(lowest_pri, enable_blocktasks , enable_evicttasks);
if(enable_blocktasks_spill){
RAY_LOG(DEBUG) << "[JAE_DEBUG] task " << task_id << " spins";
for(auto it = queue_.begin(); it != queue_.end(); it++){
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use queue.size()?

@stephanie-wang
Copy link
Owner

stephanie-wang commented Apr 7, 2022

The implementation looks much better!

We have a separate PR for test scripts. Do you want me to add test of this PR here or to the test script PR?

Can you add just one script that previously deadlocked and now works with this PR?

@stephanie-wang stephanie-wang merged commit 9bade78 into memory-scheduling May 24, 2022
@stephanie-wang stephanie-wang deleted the memory-scheduling-deadlock branch May 24, 2022 22:22
stephanie-wang pushed a commit that referenced this pull request Aug 5, 2022
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is:

```
#0  0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) ()
   from /lib64/libstdc++.so.6
#1  0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#2  0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#3  0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#4  0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] ()
   from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#5  0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#6  0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#8  0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#9  0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2
#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2
#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6
#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2
#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>)
    at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369
```

The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`).

It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`.

The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though.

BTW, I've tried different approaches:

1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well.
2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants