Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve jl_print_task_backtraces() #50968

Closed
wants to merge 1 commit into from
Closed

Conversation

kpamnany
Copy link
Contributor

We use jl_rec_backtrace() which tries to set the task's tid to the current thread before gathering the backtrace. This will fail for tasks that are sticky to another thread as their tid is never reset. However, for jl_print_task_backtraces(), we aren't concerned about thread safety since we assume that all threads are stopped so we add a flag to jl_rec_backtrace() to ignore the task's tid.

With this, jl_print_task_backtraces() should now only miss tasks that are currently executing on threads other than the calling thread.

@kpamnany kpamnany requested a review from vtjnash August 18, 2023 19:29
@kpamnany kpamnany force-pushed the kp/more-task-backtraces branch 2 times, most recently from c812a9e to 857e51c Compare August 18, 2023 20:08
@vtjnash
Copy link
Member

vtjnash commented Aug 18, 2023

However, for jl_print_task_backtraces(), we aren't concerned about thread safety since we assume that all threads are stopped

Why? They are almost never stopped when this code is running. Particularly in GDB, we should assume all threads are running, since that is the default.

@brenhinkeller brenhinkeller added error handling Handling of exceptions by Julia or the user multithreading Base.Threads and related functionality labels Aug 19, 2023
@kpamnany kpamnany force-pushed the kp/more-task-backtraces branch from 857e51c to 525bcdb Compare August 21, 2023 13:54
@kpamnany
Copy link
Contributor Author

Here's my understanding:

When you attach gdb to your target process, all threads are stopped because the default behavior is all-stop mode. At this point, if we set scheduler-locking on before we call jl_print_task_backtraces(1), then the other threads remain stopped during the call, which makes it safe to do what we're doing in this PR.

Have I got this wrong?

@kpamnany kpamnany force-pushed the kp/more-task-backtraces branch from 525bcdb to 470fe83 Compare August 21, 2023 14:15
@vtjnash
Copy link
Member

vtjnash commented Aug 21, 2023

With set scheduler-locking on this code might deadlock and cause gdb to fail, though you can potentially notice in advance that one of the threads is currently holding one of the required locks and decide not to call the function

@vtjnash
Copy link
Member

vtjnash commented Aug 21, 2023

Also, if you can't acquire the tid lock here, that means the contents of memory on that thread is in an inconsistent state, and attempts to unwind it from here may go badly.

@kpamnany
Copy link
Contributor Author

With set scheduler-locking on this code might deadlock and cause gdb to fail, though you can potentially notice in advance that one of the threads is currently holding one of the required locks and decide not to call the function

Which required locks? I see jl_in_stackwalk; any others?

We use this jl_print_task_backtraces() call when our server process freezes and we're about to kill it. The idea is to try to spot any deadlocks that might have caused that freeze. If some task stacks are missing (as they are), then that makes this capability not very useful.

Completely open to any suggestions for improvement.

Also, if you can't acquire the tid lock here, that means the contents of memory on that thread is in an inconsistent state, and attempts to unwind it from here may go badly.

The tid lock (assuming you're talking about this line) cannot be acquired for sticky tasks running on other threads because their tids are never unset. That's the point of this patch.

@NHDaly
Copy link
Member

NHDaly commented Sep 14, 2023

The other major alternative to this PR, I think, would be to preempt all the other threads, exactly like the profiler does.

Then, once they're all paused, we should be free to safely walk the stacks of all Tasks, right?


But i think Kiran's question still stands: what locks would this stackwalking code be grabbing? Why should we need to preempt the threads inside julia's runtime if we have already paused them from GDB?

If there's any locks that this code is grabbing, can we use the new, unsafe ignore_tid flag to also ignore those locks and just not lock at all?

We use `jl_rec_backtrace()` which tries to set the task's tid to the
current thread before gathering the backtrace. This will fail for
tasks that are sticky to another thread as their tid is never reset.
However, for `jl_print_task_backtraces()`, we aren't concerned about
thread safety since we assume that all threads are stopped so we add
a flag to `jl_rec_backtrace()` to ignore the task's tid.

With this, `jl_print_task_backtraces()` should now only miss tasks
that are currently executing on threads other than the calling thread.
@kpamnany kpamnany force-pushed the kp/more-task-backtraces branch from 470fe83 to 4dc3a34 Compare September 18, 2023 17:58
@kpamnany
Copy link
Contributor Author

Okay, I added some comments that make it clear that jl_print_task_backtraces() is a Hail Mary attempt to get some useful information from a Julia process that is about to crash/be killed, which is how we use it. Does that help @vtjnash?

@kpamnany
Copy link
Contributor Author

We discussed this some more -- this capability has already been very useful for us so we'd really like to improve it.

How would you want to implement this capability @vtjnash, such that no tasks are missed?

@kpamnany
Copy link
Contributor Author

Superseded by #51430. Thanks @vtjnash!

@kpamnany kpamnany closed this Sep 23, 2023
@vtjnash vtjnash deleted the kp/more-task-backtraces branch September 23, 2023 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error handling Handling of exceptions by Julia or the user multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants