-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker autoscaling #15
Conversation
python/ray/autoscaler/commands.py
Outdated
target: target file on host | ||
override_cluster_name: set the name of the cluster | ||
""" | ||
config = yaml.load(open(config_file).read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with open(config_file, 'r') as f:
config = yaml.load(f)
python/ray/autoscaler/commands.py
Outdated
config["cluster_name"] = override_cluster_name | ||
config = _bootstrap_config(config) | ||
base_path = os.path.basename(target_script) | ||
cname = config["docker"]["container_name"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
container_name = ...
python/ray/autoscaler/commands.py
Outdated
config = _bootstrap_config(config) | ||
base_path = os.path.basename(target_script) | ||
cname = config["docker"]["container_name"] | ||
cmd = "docker cp {} {}:{}".format(target_script, cname, base_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to wrap the target with quotes, so that paths with ~
work too:
cmd = "docker cp {} {}:\"{}\"".format(target_script, cname, base_path)
Try hiding a symlink between ray_results for Tune and a host-container mount, which allows for transparent rsync between driver and worker nodes. |
* Track parent actor of actor * Update src/ray/raylet/node_manager.cc Co-Authored-By: Stephanie Wang <[email protected]> * Update src/ray/raylet/node_manager.cc Co-Authored-By: Stephanie Wang <[email protected]> * fixing a comment * Fixing typo in a comment * capturing task_spec instead of actor_data * adding const for some local variables * changing an if else to else * Linted version * use updated method to create task from task_data Change-Id: I9c1a65134dc23a2d175047e96b86ab9d9cf61971 * fixing linter issues Change-Id: I1def06218130b399d2527b999258aecf9abb98dd
…73) (ray-project#5193) * make parameter space noise consistent with action space noise * modified according to lint check * indent
* fast reconstruct dead actors * add test * fix typos * remove debug print * small fix * fix typos * Update test_actor.py
…ay-project#5637) * add intenral pin method * add pin * update
* [rllib] better model docs * fix * s
* Start testing test_fork Maybe queue actor takes too long to initialize, that's why we are seeing "Many python processes started" since most of the python tasks are blocked on ray.get * Add a comment
…timizer for multiagent (ray-project#5683)
* Validate that entropy coeff is not an integer Passing an integer value for entropy coeff such as 0 raises an error somewhere inside the TF policy graph, so this checks to make sure the entropy coeff is a float. * Cast to float instead Also move this check after the negative value check
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is: ``` #0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /lib64/libstdc++.so.6 #1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2 #7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2 #8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2 #11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2 #12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6 #14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2 #15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 #16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>) at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369 ``` The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`). It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`. The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though. BTW, I've tried different approaches: 1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well. 2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.
Test