-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix segfault when cancel and generator is used with a high load #40083
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, future.result() returns and rasie an exception when future.cancel() is called although the coroutine itself is actually not cancelled. I verified this from a local example.
Is this behavior documented by Python?
It is not documented, but a relevant issue; python/cpython#105836 |
@@ -0,0 +1,33 @@ | |||
import requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks extremely sketchy and blocking the loop makes me uncomfortable.
We are already using an event to signal the completion of the coroutine inside of run_async_func_or_coro_in_event_loop
:
Line 4298 in 9d4ae4e
future.add_done_callback(lambda _: event.Notify()) |
I assume the current issue is that the callback added with add_done_callback
is running prior to the coroutine actually completing. If that's the case, then how about we move event.Notify()
into a finally
block in the async_func
just above?
I believe that'll accomplish the same thing without requiring the blocking wait()
call.
@@ -1196,36 +1196,43 @@ async def execute_streaming_generator_async( | |||
|
|||
Args: | |||
context: The context to execute streaming generator. | |||
coroutine_complete_event: The asyncio.Event to notify the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this and the type hint say asyncio.Event but a threading.Event is passed in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I tried with non-blocking appraoch, and it didn't work, and I changed to blocking way. Let's just go with #40122 as it makes more sense to me..
@@ -2235,7 +2260,7 @@ cdef void cancel_async_task( | |||
function_descriptor, name_of_concurrency_group_to_execute) | |||
future = worker.core_worker.get_queued_future(task_id) | |||
if future is not None: | |||
future.cancel() | |||
eventloop.call_soon_threadsafe(future.cancel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why's this change needed btw?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to reduce the impact of the blocking call (so when things are cancelled, the coroutine is already context switched). I think with the new fix, it is not necessray
@rkooo567 I just verified that my suggested change fixes the test case you've added. Without the following diff, I see the segfaults. With the following diff, the test passes consistently (without any of your other changes):
|
Made a PR to run CI against it: #40122 |
@edoakes this fix makes sense |
there's a better/safer approach here; #40122 |
…#40083 (#40122) Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: SangBin Cho <[email protected]>
…ray-project#40083 (ray-project#40122) Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: SangBin Cho <[email protected]>
…#40083 (#40122) (#40126) Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Co-authored-by: SangBin Cho <[email protected]>
…ray-project#40083 (ray-project#40122) Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: SangBin Cho <[email protected]>
…ray-project#40083 (ray-project#40122) Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: SangBin Cho <[email protected]> Signed-off-by: Victor <[email protected]>
Why are these changes needed?
Right now, future.result() returns and rasie an exception when future.cancel() is called although the coroutine itself is actually not cancelled. I verified this from a local example.
This means the future.result can return and the task states are cleaned up (since task is cancelled) while we are running the actual coroutine & cpp code in a different thread. This causes a random sigabrt and sigsegv when there are lots of cancel & generator tasks are running.
For a generator, we are using a custom await between generator for the perf reason:
ray/python/ray/_raylet.pyx
Line 1222 in f11fa84
The same thing can technically happen for a regular async task (if it uses thread pool). If it doesn't use a thread pool, it doesn't happen because the future.cancel is not scheduled until the coroutine is awaited (whereas for a generator, it always have await between each run because of the threadpool above).
I fixed the issue by waiting until the coroutine is finished. Note that currently it is a blocking call, but it will mostly be a short time because when the cancellation error is raised the coroutine is already cancelled.
Related issue number
Closes #39703
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.