forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing crash during task preemption #6
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…" (ray-project#18719) This reverts commit 8ce01ea.
* wip * client tests working again * extra prints * start reconnect logic for proxier * local proxy more wip * delay cleanup logic working on proxy * Fix up dataservicer logic * lint + fix proxy data servicer exit logic * hmmm * delay cleanup always in dataservicer * fix last_seen check * cancel channel on error * explicitly request cleanup * cleanup request fixes * fix dataclient proxy * start idempotence logic * change default channel state * add backoff logic * move connection logic back into worker.__init__ * add logic for replay cache case where request was received but response hasn't been fully resolved * new proto entries for data stream caching * start replay_cache logic, increase cleanup delay * hardcode retries * Let data channel attempt reconnects * manually reset queue, remove replay_cache logic * reduce cleanup delay to 5 minutes * fix local tests * Remove async cache logic * retry async requests * simplify backoff logic * Fix ray client proto * Configurable reconnect grace period * Basic logsclient fix? * Configure grace through environment variable * Use stopped event to force faster datapath cleanup * Better connect+reconnect logic * fix reconnect_grace_period default * init fixes for reconnect_grace_period * cleanup * fix _get_client_id_from_context call * add logic for pathological cache cases * less intrusive data channel error message * fix tests * Make stuff less painful to read * add ordered replay cache for dataservicer, replay cache tests * fix ordering import, start_reconnect test * add middleman testing logic * enforce ordering of dataclient requests * retry wheels * grace period through env only, restore test_dataclient_disconnect * minor fixes * force rerun * less intrusive error msgs * address review * replay->response cache * remove unneeded sleep * typing * extra response cache test * fix error msg * remove TODO * add _reconnect_channel * add grace period test * store thread_id and req_id in metadata * Revert "store thread_id and req_id in metadata" This reverts commit 12bc05c. * Revert "Revert "store thread_id and req_id in metadata"" This reverts commit 67874cf. * fix metadata check * remove comment * removed unused cv * cast back to int * refactor Datapath for readability * Revert refactor This reverts commit f789bad. * fix comment * merge fixes * refactor _shutdown * address reviews * log errors in both cases * add comments * address reviews * move reconnect test to medium * Always propogate error to callbacks * readability * formatting * Faster cleanup on uncaught dataservicer errors * delete tmp file * offset commit * rrefactor * propagate data servicer error message * Stricter handling/propagation of errors * remove tmp file * better docs * forward reconnecting metadata * add annotation * fix invalidate + add test * fix docstrings and types * disable retries and caching if reconnect grace period is set to 0 * update comments * address review, increase ack batch size and skip ack's if reconnect isn't enabled * Don't terminate data stream on missing reconnecting metadata
* wip * wip * add horovod example * add example * lint * fix * address comments * updates * lint * update example * address comment * address comment * update * fix * Update python/ray/util/sgd/v2/examples/horovod/horovod_stateful_example.py Co-authored-by: matthewdeng <[email protected]> * address comments * add back name mangling * fix tests * Update python/ray/util/sgd/v2/trainer.py * fix * lint * fix * fix docstring * Update python/ray/util/sgd/v2/tests/test_trainer.py Co-authored-by: matthewdeng <[email protected]> * update * fix failing test Co-authored-by: matthewdeng <[email protected]>
…CI (ray-project#18622) * Add Bazel config for building with llvm. Upgrade C++ std to 17. * Fix redis. Try fixing asan and tsan * Fix asan and format * Update comments. Co-authored-by: Chen Shen <[email protected]>
…ct#18677) * Allow Client{Object,Actor}Ref to accept a future. Check number of args and returns synchronously. * rename callback, fix
…ay-project#18180) * recovery failure uses same termination function * More cleanup * More cleanup * ips * wip * wip * wip * Fix tests * tweak
…t#18729)" (ray-project#18754) This reverts commit 57edc0c.
…xperiment (due to grid-search in IMPALA stress tests). (ray-project#18705)
…l delay. (ray-project#18555) Co-authored-by: Jun Gong <[email protected]>
* Fix assertion crash * test, lint * todo * x
Gym appears to have cut a release, 0.21. It isn't clear what changes were made between 0.19/0.20 and 0.21, as there is no change log available for the 0.21 release, so for now we'll pin gym to 0.19 until we can fully understand the breaking changes in gym 0.21. I suspect some things have just been removed from the regular gym installation that rllib has previously relied on. Will address later.
…ups. (ray-project#19129) * Add a regression test for the short term * done * address code review * lint
…project#19076)" (ray-project#19160) This reverts commit 4beba3f.
…t#18167)" (ray-project#19180) This reverts commit 545db13.
* remove unused * lint * lint * lint
Implemented basic DFS priority assignment and task blocking and preemption. Currently segfaults due to object already existing during task retry.
stephanie-wang
pushed a commit
that referenced
this pull request
Aug 5, 2022
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is: ``` #0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /lib64/libstdc++.so.6 #1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2 #7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2 #8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2 #11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2 #12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6 #14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2 #15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 #16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>) at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369 ``` The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`). It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`. The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though. BTW, I've tried different approaches: 1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well. 2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.
stephanie-wang
pushed a commit
that referenced
this pull request
Jan 22, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.