Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker autoscaling #15

Closed
wants to merge 249 commits into from
Closed

Conversation

richardliaw
Copy link
Owner

Test

target: target file on host
override_cluster_name: set the name of the cluster
"""
config = yaml.load(open(config_file).read())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with open(config_file, 'r') as f:
config = yaml.load(f)

config["cluster_name"] = override_cluster_name
config = _bootstrap_config(config)
base_path = os.path.basename(target_script)
cname = config["docker"]["container_name"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container_name = ...

config = _bootstrap_config(config)
base_path = os.path.basename(target_script)
cname = config["docker"]["container_name"]
cmd = "docker cp {} {}:{}".format(target_script, cname, base_path)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to wrap the target with quotes, so that paths with ~ work too:

    cmd = "docker cp {} {}:\"{}\"".format(target_script, cname, base_path)

@richardliaw
Copy link
Owner Author

Try hiding a symlink between ray_results for Tune and a host-container mount, which allows for transparent rsync between driver and worker nodes.

vipulharsh and others added 18 commits July 11, 2019 14:52
* Track parent actor of actor

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: Stephanie Wang <[email protected]>

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: Stephanie Wang <[email protected]>

* fixing a comment

* Fixing typo in a comment

* capturing task_spec instead of actor_data

* adding const for some local variables

* changing an if else to else

* Linted version

* use updated method to create task from task_data

Change-Id: I9c1a65134dc23a2d175047e96b86ab9d9cf61971

* fixing linter issues

Change-Id: I1def06218130b399d2527b999258aecf9abb98dd
…73) (ray-project#5193)

*  make parameter space noise consistent with action space noise

*  modified according to lint check

*  indent
* fast reconstruct dead actors

* add test

* fix typos

* remove debug print

* small fix

* fix typos

* Update test_actor.py
kfstorm and others added 27 commits September 6, 2019 17:49
* Start testing test_fork

Maybe queue actor takes too long to initialize, that's why we are
seeing "Many python processes started" since most of the python
tasks are blocked on ray.get

* Add a comment
* Validate that entropy coeff is not an integer

Passing an integer value for entropy coeff such as 0 raises an error somewhere inside the TF policy graph, so this checks to make sure the entropy coeff is a float.

* Cast to float instead

Also move this check after the negative value check
@richardliaw richardliaw closed this Mar 1, 2020
richardliaw pushed a commit that referenced this pull request Mar 25, 2021
richardliaw pushed a commit that referenced this pull request Aug 2, 2022
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is:

```
#0  0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) ()
   from /lib64/libstdc++.so.6
#1  0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#2  0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#3  0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#4  0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] ()
   from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#5  0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#6  0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#8  0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#9  0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2
#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2
#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6
#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2
#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>)
    at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369
```

The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`).

It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`.

The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though.

BTW, I've tried different approaches:

1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well.
2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.