Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][compiled graphs] test_dag_exception_chained does not exit cleanly #48806

Open
ruisearch42 opened this issue Nov 19, 2024 · 0 comments
Open
Assignees
Labels
beta Beta release feture bug Something that is supposed to be working; but isn't compiled-graphs P1 Issue that should be fixed within a few weeks

Comments

@ruisearch42
Copy link
Contributor

What happened + What you expected to happen

When running test_accelerated_dag.py, got a segfault:

python/ray/dag/tests/experimental/test_accelerated_dag.py::test_dag_exception_basic 2024-11-19 06:04:36,880     INFO worker.py:1821 -- Started a local Ray instance.
PASSED2024-11-19 06:04:37,303   INFO compiled_dag_node.py:1935 -- Tearing down compiled DAG
2024-11-19 06:04:37,303 INFO compiled_dag_node.py:1940 -- Cancelling compiled worker on actor: Actor(Actor, 58b862c7b9083222dd3b241001000000)
2024-11-19 06:04:37,305 INFO compiled_dag_node.py:1960 -- Waiting for worker tasks to exit
2024-11-19 06:04:37,305 INFO compiled_dag_node.py:1962 -- Teardown complete

python/ray/dag/tests/experimental/test_accelerated_dag.py::test_dag_exception_chained 2024-11-19 06:04:39,604   INFO worker.py:1821 -- Started a local Ray instance.
PASSED2024-11-19 06:04:40,030   INFO compiled_dag_node.py:1935 -- Tearing down compiled DAG

python/ray/dag/tests/experimental/test_accelerated_dag.py::test_dag_exception_multi_output[True] 2024-11-19 06:04:42,312        INFO worker.py:1821 -- Started a local Ray instance.
*** SIGSEGV received at time=1731996283 on cpu 24 ***
PC: @     0x7fd8513d6349  (unknown)  ray::core::CoreWorkerMemoryStore::GetImpl()
    @     0x7fd85483f420       2960  (unknown)
    @     0x7fd8513d6d60         64  ray::core::CoreWorkerMemoryStore::Get()
    @     0x7fd8513d7ac9        208  ray::core::CoreWorkerMemoryStore::Get()
    @     0x7fd8512d62fa       1888  ray::core::CoreWorker::GetObjects()
    @     0x7fd8512e180c        176  ray::core::CoreWorker::Get()
    @     0x7fd8511cb29f        240  __pyx_pw_3ray_7_raylet_10CoreWorker_41get_objects()
    @     0x5593606aaf37  (unknown)  method_vectorcall_VARARGS_KEYWORDS.cold
[2024-11-19 06:04:43,044 E 1175783 1205302] logging.cc:440: *** SIGSEGV received at time=1731996283 on cpu 24 ***
[2024-11-19 06:04:43,044 E 1175783 1205302] logging.cc:440: PC: @     0x7fd8513d6349  (unknown)  ray::core::CoreWorkerMemoryStore::GetImpl()
[2024-11-19 06:04:43,044 E 1175783 1205302] logging.cc:440:     @     0x7fd85483f420       2960  (unknown)
[2024-11-19 06:04:43,044 E 1175783 1205302] logging.cc:440:     @     0x7fd8513d6d60         64  ray::core::CoreWorkerMemoryStore::Get()
[2024-11-19 06:04:43,044 E 1175783 1205302] logging.cc:440:     @     0x7fd8513d7ac9        208  ray::core::CoreWorkerMemoryStore::Get()
[2024-11-19 06:04:43,044 E 1175783 1205302] logging.cc:440:     @     0x7fd8512d62fa       1888  ray::core::CoreWorker::GetObjects()
[2024-11-19 06:04:43,044 E 1175783 1205302] logging.cc:440:     @     0x7fd8512e180c        176  ray::core::CoreWorker::Get()
[2024-11-19 06:04:43,044 E 1175783 1205302] logging.cc:440:     @     0x7fd8511cb29f        240  __pyx_pw_3ray_7_raylet_10CoreWorker_41get_objects()
[2024-11-19 06:04:43,044 E 1175783 1205302] logging.cc:440:     @     0x5593606aaf37  (unknown)  method_vectorcall_VARARGS_KEYWORDS.cold
[2]    1175783 segmentation fault (core dumped)  pytest -vvs 

#48795 is a workaround, but we need a proper fix.

I've been debugging and found that teardown() was called from two different places:

python/ray/dag/tests/experimental/test_accelerated_dag.py::test_dag_exception_chained 2024-11-19 17:27:39,104   INFO worker.py:1821 -- Started a local Ray instance.
2024-11-19 17:27:39,378 INFO serialization.py:72 -- pickle_dumps() on <class 'test_accelerated_dag._modify_class.<locals>.Class'>
2024-11-19 17:27:39,382 INFO compiled_dag_node.py:2440 -- teardown() called
Stack (most recent call last):
  File "/opt/conda/envs/vllm-env/bin/pytest", line 8, in <module>
    sys.exit(console_main())
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/config/__init__.py", line 201, in console_main
    code = main()
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/config/__init__.py", line 175, in main
    ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/main.py", line 330, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/main.py", line 283, in wrap_session
    session.exitstatus = doit(config, session) or 0
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/main.py", line 337, in _main
    config.hook.pytest_runtestloop(session=session)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/main.py", line 362, in pytest_runtestloop
    item.config.hook.pytest_runtest_protocol(item=item, nextitem=nextitem)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 113, in pytest_runtest_protocol
    runtestprotocol(item, nextitem=nextitem)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 132, in runtestprotocol
    reports.append(call_and_report(item, "call", log))
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 241, in call_and_report
    call = CallInfo.from_call(
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 341, in from_call
    result: TResult | None = func()
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 242, in <lambda>
    lambda: runtest_hook(item=item, **kwds), when=when, reraise=reraise
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 174, in pytest_runtest_call
    item.runtest()
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/python.py", line 1627, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/python.py", line 159, in pytest_pyfunc_call
    result = testfunction(**testargs)
  File "/home/ubuntu/ray/python/ray/dag/tests/experimental/test_accelerated_dag.py", line 1058, in test_dag_exception_chained
    a = Actor.remote(0)
  File "/home/ubuntu/ray/python/ray/actor.py", line 733, in remote
    return self._remote(args=args, kwargs=kwargs, **self._default_options)
  File "/home/ubuntu/ray/python/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/ray/python/ray/util/tracing/tracing_helper.py", line 388, in _invocation_actor_class_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/home/ubuntu/ray/python/ray/actor.py", line 1057, in _remote
    worker.function_actor_manager.export_actor_class(
  File "/home/ubuntu/ray/python/ray/_private/function_manager.py", line 487, in export_actor_class
    serialized_actor_class = pickle_dumps(
  File "/home/ubuntu/ray/python/ray/_private/serialization.py", line 73, in pickle_dumps
    return pickle.dumps(obj)
  File "/home/ubuntu/ray/python/ray/cloudpickle/cloudpickle.py", line 1479, in dumps
    cp.dump(obj)
  File "/home/ubuntu/ray/python/ray/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
  File "/home/ubuntu/ray/python/ray/cloudpickle/cloudpickle.py", line 1330, in reducer_override
    return self._function_reduce(obj)
  File "/home/ubuntu/ray/python/ray/cloudpickle/cloudpickle.py", line 1208, in _function_reduce
    return self._dynamic_function_reduce(obj)
  File "/home/ubuntu/ray/python/ray/cloudpickle/cloudpickle.py", line 1193, in _dynamic_function_reduce
    state = _function_getstate(func)
  File "/home/ubuntu/ray/python/ray/cloudpickle/cloudpickle.py", line 720, in _function_getstate
    func.__code__, itertools.chain(f_globals.values(), closure_values)
  File "/home/ubuntu/ray/python/ray/dag/compiled_dag_node.py", line 2453, in __del__
    self.teardown()
  File "/home/ubuntu/ray/python/ray/dag/compiled_dag_node.py", line 2440, in teardown
    logger.info("teardown() called", stack_info=True)
2024-11-19 17:27:39,390 INFO serialization.py:72 -- pickle_dumps() on <class 'ray.dag.compiled_dag_node._modify_class.<locals>.Class'>
PASSED2024-11-19 17:27:39,535   INFO compiled_dag_node.py:2440 -- teardown() called
Stack (most recent call last):
  File "/opt/conda/envs/vllm-env/bin/pytest", line 8, in <module>
    sys.exit(console_main())
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/config/__init__.py", line 201, in console_main
    code = main()
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/config/__init__.py", line 175, in main
    ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/main.py", line 330, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/main.py", line 283, in wrap_session
    session.exitstatus = doit(config, session) or 0
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/main.py", line 337, in _main
    config.hook.pytest_runtestloop(session=session)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/main.py", line 362, in pytest_runtestloop
    item.config.hook.pytest_runtest_protocol(item=item, nextitem=nextitem)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 113, in pytest_runtest_protocol
    runtestprotocol(item, nextitem=nextitem)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 137, in runtestprotocol
    reports.append(call_and_report(item, "teardown", log, nextitem=nextitem))
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 241, in call_and_report
    call = CallInfo.from_call(
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 341, in from_call
    result: TResult | None = func()
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/runner.py", line 242, in <lambda>
    lambda: runtest_hook(item=item, **kwds), when=when, reraise=reraise
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pytest_rerunfailures.py", line 472, in pytest_runtest_teardown
    reruns = get_reruns_count(item)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pytest_rerunfailures.py", line 108, in get_reruns_count
    rerun_marker = _get_marker(item)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/pytest_rerunfailures.py", line 104, in _get_marker
    return item.get_closest_marker("flaky")
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/nodes.py", line 372, in get_closest_marker
    return next(self.iter_markers(name=name), default)
  File "/opt/conda/envs/vllm-env/lib/python3.10/site-packages/_pytest/nodes.py", line 344, in iter_markers
    return (x[1] for x in self.iter_markers_with_node(name=name))
  File "/home/ubuntu/ray/python/ray/dag/compiled_dag_node.py", line 2453, in __del__
    self.teardown()
  File "/home/ubuntu/ray/python/ray/dag/compiled_dag_node.py", line 2440, in teardown
    logger.info("teardown() called", stack_info=True)
2024-11-19 17:27:39,535 INFO compiled_dag_node.py:1935 -- Tearing down compiled DAG
2024-11-19 17:27:39,536 INFO worker.py:1885 -- ray.shutdown() called
2024-11-19 17:27:40,763 INFO worker.py:1924 -- ray.shutdown() finished

These should be good leads to the root cause.

Versions / Dependencies

2.39 / head

Reproduction script

Run test_accelerated_dag.py after reverting #48795

Issue Severity

None

@ruisearch42 ruisearch42 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) compiled-graphs labels Nov 19, 2024
@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks beta Beta release feture and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beta Beta release feture bug Something that is supposed to be working; but isn't compiled-graphs P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

2 participants