Add USDTs for the task runtime #43453

jpsamaroo · 2021-12-17T20:11:49Z

These probes may prove useful for determining task boundaries (when a Julia thread is executing a given task), to be paired with other information, such as GC initiation.

@vchuravy do we want any other USDTs in this PR? I'd like to cover all cases where a task begins/finishes executing.

Todo:

Add task switch probes
Add process_events probes
Validate with BPFnative.jl
Remove PTLS argument from probes
Add get_next_task probes
Implement end-to-end demo dumping trace events to file

tkf · 2021-12-17T23:34:59Z

Maybe some of what @kpamnany did in #36870 makes sense as USDTs too?

Also, it'd be great if we can record what's going on in the multi-queue. I don't know if it's the correct way to do it, but maybe something like this?

             JL_UV_UNLOCK();
diff --git a/src/partr.c b/src/partr.c
index 048a841158..75eb367b09 100644
--- a/src/partr.c
+++ b/src/partr.c
@@ -142,6 +142,7 @@ static inline int multiq_insert(jl_task_t *task, int16_t priority)
     if (jl_atomic_load_relaxed(&heaps[rn].ntasks) >= tasks_per_heap) {
         uv_mutex_unlock(&heaps[rn].lock);
         // multiq insertion failed, increase #tasks per heap
+        JL_PROBE_RT_MULTIQ_INSERT_FULL(ptls->tid);
         return -1;
     }
 
@@ -154,6 +155,7 @@ static inline int multiq_insert(jl_task_t *task, int16_t priority)
         jl_atomic_store_relaxed(&heaps[rn].prio, task->prio);
     uv_mutex_unlock(&heaps[rn].lock);
 
+    JL_PROBE_RT_MULTIQ_INSERT_SUCCESS(ptls->tid);
     return 0;
 }
 
@@ -179,17 +181,22 @@ static inline jl_task_t *multiq_deletemin(void)
         else if (prio1 == prio2 && prio1 == INT16_MAX)
             continue;
         if (uv_mutex_trylock(&heaps[rn1].lock) == 0) {
-            if (prio1 == jl_atomic_load_relaxed(&heaps[rn1].prio))
+            if (prio1 == jl_atomic_load_relaxed(&heaps[rn1].prio)) {
+                JL_PROBE_RT_MULTIQ_DELETEMIN_TRYLOCK_SUCCESS(ptls->tid, i);
                 break;
+            }
             uv_mutex_unlock(&heaps[rn1].lock);
         }
     }
-    if (i == heap_p)
+    if (i == heap_p) {
+        JL_PROBE_RT_MULTIQ_DELETEMIN_EMPTYISH(ptls->tid);
         return NULL;
+    }
 
     task = heaps[rn1].tasks[0];
     if (!jl_set_task_tid(task, ptls->tid)) {
         uv_mutex_unlock(&heaps[rn1].lock);
+        JL_PROBE_RT_MULTIQ_DELETEMIN_TASKLOCK_FAILED(ptls->tid);
         goto retry;
     }
     int32_t ntasks = jl_atomic_load_relaxed(&heaps[rn1].ntasks) - 1;
@@ -204,6 +211,7 @@ static inline jl_task_t *multiq_deletemin(void)
     jl_atomic_store_relaxed(&heaps[rn1].prio, prio1);
     uv_mutex_unlock(&heaps[rn1].lock);
 
+    JL_PROBE_RT_MULTIQ_DELETEMIN_SUCCESS(ptls->tid, task);
     return task;
 }

I also wondered if it makes sense to record jl_process_events and if it can be used for debugging JuliaLang/Downloads.jl#136

diff --git a/src/jl_uv.c b/src/jl_uv.c
index 0f616cdebb..eff4c9a040 100644
--- a/src/jl_uv.c
+++ b/src/jl_uv.c
@@ -205,6 +205,7 @@ JL_DLLEXPORT int jl_process_events(void)
     jl_gc_safepoint_(ct->ptls);
     if (loop && (jl_atomic_load_relaxed(&_threadedregion) || jl_atomic_load_relaxed(&ct->tid) == 0)) {
         if (jl_atomic_load(&jl_uv_n_waiters) == 0 && jl_mutex_trylock(&jl_uv_mutex)) {
+            JL_PROBE_RT_PROCESS_EVENTS(ct);
             loop->stop_flag = 0;
             int r = uv_run(loop, UV_RUN_NOWAIT);
             JL_UV_UNLOCK();

But I've never succeeded in setting up bpftrace and so I don't know how much of it makes sense :)

jpsamaroo · 2021-12-18T03:02:00Z

Yeah, I'll have to give that a try!

Marking this as draft since I need to think longer on how all this will interact with things like user stack introspection; I'll probably split the ctx_switch probe into two.

jpsamaroo · 2021-12-18T15:57:21Z

And @tkf, I'll probably just implement a single JL_PROBE_RT_MULTIQ_STATE, with a Cint state argument which will indicate what multiq event happened (it's somewhat a pain to write and synchronize multiple probe programs). Nope, will take it mostly as-is, but I'm going to pass the full ptls instead of ptls->tid; does this seem excessive @tkf?

src/uprobes.d

tkf · 2021-12-18T23:56:34Z

pass the full ptls instead of ptls->tid

Can you look at ptls->tid when using bpftrace? If so, I think passing ptls is actually better since that's more general.

vchuravy · 2021-12-19T01:02:56Z

Can you look at ptls->tid when using bpftrace? If so, I think passing ptls is actually better since that's more general.

index 27689d5dc6..fa59df9f15 100755
--- a/contrib/bpftrace/rt_all.bt
+++ b/contrib/bpftrace/rt_all.bt
@@ -1,5 +1,11 @@
 #!/usr/bin/env bpftrace
 
+// From julia_threads.h
+struct jl_tls_states_t {
+    int16_t tid;
+    // ...
+};
+
 BEGIN
 {
     printf("Tracing Julia Task events... Hit Ctrl-C to end.\n");
@@ -8,7 +14,8 @@ BEGIN
 usdt:usr/lib/libjulia-internal.so:julia:rt__run__task
 {
     @resume[ustack] = count();
-    printf("Task running: %x (with PTLS %x)\n", arg0, arg1);
+    @ptls = ((struct jl_tls_states_t*)arg1);
+    printf("Task running: %x (on TID %x)\n", arg0, @ptls->tid);
 }
 
 usdt:usr/lib/libjulia-internal.so:julia:rt__pause__task

Yes we can. Sadly #include usr/include/julia/julia.h since we then collide with the kernel variants of some types :/

jpsamaroo · 2021-12-19T16:10:21Z

If we forget about bpftrace, we should be able to use CBinding.jl to parse Julia's headers, and with jpsamaroo/BPFnative.jl#31, we should be able to access PTLS struct members in Julia.

src/task.c

jpsamaroo · 2021-12-30T17:37:02Z

@tkf how would you briefly describe the deletemin__trylock__* probes?

doc/src/devdocs/probes.md

tkf · 2022-01-04T16:28:08Z

For queue-agnostic trace points, maybe something like this?

diff --git a/src/partr.c b/src/partr.c
index 0ba4f08627..4f1958159f 100644
--- a/src/partr.c
+++ b/src/partr.c
@@ -419,15 +419,21 @@ JL_DLLEXPORT void jl_wakeup_thread(int16_t tid)
 // get the next runnable task from the multiq
 static jl_task_t *get_next_task(jl_value_t *trypoptask, jl_value_t *q)
 {
+    jl_task_t *ct = jl_current_task;
+    JL_PROBE_RT_GET_NEXT_TASK_BEGIN(ct);
     jl_gc_safepoint();
     jl_value_t *args[2] = { trypoptask, q };
     jl_task_t *task = (jl_task_t*)jl_apply(args, 2);
     if (jl_typeis(task, jl_task_type)) {
         int self = jl_atomic_load_relaxed(&jl_current_task->tid);
         jl_set_task_tid(task, self);
+        JL_PROBE_RT_POP_STICKY_SUCCESS(ct, task);
         return task;
     }
-    return multiq_deletemin();
+    JL_PROBE_RT_POP_STICKY_FAILED(ct);
+    task = multiq_deletemin();
+    JL_PROBE_RT_GET_NEXT_TASK_END(ct, task);
+    return task;
 }
 
 static int may_sleep(jl_ptls_t ptls) JL_NOTSAFEPOINT

...or ptls instead of ct, although that's a pointer fetch away. But if we want to minimize the computation then pgcstack is better than ct. I have no idea if it matters though :) jl_apply loads these things anyway

tkf · 2022-01-04T16:37:33Z

Or maybe we also want to put the trace points in the task pop loop in jl_task_get_next? @kpamnany

Also, JL_PROBE_RT_GET_NEXT_TASK_BEGIN vs JL_PROBE_RT_TASK_GET_NEXT_BEGIN etc. are terrible way to distinguish them... Maybe call the first ones (the one inside get_next_task) JL_PROBE_RT_TASK_TRY_GET_NEXT_BEGIN etc.?

diff --git a/src/partr.c b/src/partr.c
index 0ba4f08627..e7e50b336b 100644
--- a/src/partr.c
+++ b/src/partr.c
@@ -445,9 +445,11 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
     jl_task_t *ct = jl_current_task;
     uint64_t start_cycles = 0;
 
+    JL_PROBE_RT_TASK_GET_NEXT_BEGIN(ct);
     while (1) {
         jl_task_t *task = get_next_task(trypoptask, q);
         if (task)
+            JL_PROBE_RT_TASK_GET_NEXT_END(ct, start_cycles);
             return task;
 
         // quick, race-y check to see if there seems to be any stuff in there
@@ -472,6 +474,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
             if (task) {
                 if (jl_atomic_load_relaxed(&ptls->sleep_check_state) != not_sleeping)
                     jl_atomic_store(&ptls->sleep_check_state, not_sleeping); // let other threads know they don't need to wake us
+                JL_PROBE_RT_TASK_GET_NEXT_END(ct, start_cycles);
                 return task;
             }
 
@@ -534,6 +537,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
             }
 
             // the other threads will just wait for on signal to resume
+            JL_PROBE_RT_SLEEP_BEGIN(ct);
             JULIA_DEBUG_SLEEPWAKE( ptls->sleep_enter = cycleclock() );
             int8_t gc_state = jl_gc_safe_enter(ptls);
             uv_mutex_lock(&sleep_locks[ptls->tid]);
@@ -544,6 +548,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
             assert(jl_atomic_load_relaxed(&ptls->sleep_check_state) == not_sleeping);
             uv_mutex_unlock(&sleep_locks[ptls->tid]);
             JULIA_DEBUG_SLEEPWAKE( ptls->sleep_leave = cycleclock() );
+            JL_PROBE_RT_SLEEP_END(ct);
             jl_gc_safe_leave(ptls, gc_state); // contains jl_gc_safepoint
             start_cycles = 0;
         }

doc/src/devdocs/probes.md

src/partr.c

IanButterworth · 2022-01-25T03:26:21Z

Note that if you try it out on a recent Gtk.jl version you'll need to have a Gtk window open to enable the eventloop. Also that any keyboard (like a ctrl-c) or mouse input will unblock the 5 second idle timeout, also it doesn't happen during profiling.. It's a tricky one to investigate!

kpamnany · 2022-02-01T21:49:11Z

Agreed, that would be useful to investigate with this! We'll need some way to detect when a task is finished for that; I'll have to figure out where that hook would go.

Here I think.

Edit to add: we want one here anyway so we can track task lifetimes in general.

kpamnany · 2022-02-01T21:57:20Z

(Also curious how to feed this data into https://perfetto.dev/, but that can come later 🙂)

I think we can use the hooks here to insert writes to a JSON as done in #36870.

vchuravy

Can you rebase?

src/task.c

tkf · 2022-02-02T03:05:12Z

I think the events related to task lifetime are

new (JL_PROBE_RT_NEW_TASK)
schedule (jl_enqueue_task + sticky queue?)
start (start_task)
"de-schedule" (JL_PROBE_RT_PAUSE_TASK) (occurs multiple times)
"re-schedule" (JL_PROBE_RT_RUN_TASK) (occurs multiple times)
finish (JL_PROBE_RT_FINISH_TASK)

I think looking at the interval of 3--6 is more important than 1--6 since (shortly before) 3 is where the stack is allocated. That is to say, it let us know how long a task holds onto stack which is a rather expensive resource. So, maybe it makes sense to add yet another trace point in start_task?

(I'm not sure how much 2 is useful though. It's roughly the same as 1 unless the user throws away the created Task or something. It's nice to have, but maybe not very essential. Also, maybe it's tricky to put a cheap trace point in a Julia function?)

Co-authored-by: Takafumi Arakaki <[email protected]>

jpsamaroo · 2022-02-02T20:51:52Z

src/partr.c

                    jl_atomic_store_relaxed(&ptls->sleep_check_state, not_sleeping); // let other threads know they don't need to wake us
+                    JL_PROBE_RT_SLEEP_CHECK_TASK_WAKE(ptls);


@kpamnany should this get its own probe? It was introduced during a rebase, so I just copy-pasta'd the probe from below.

Yes, in general, every state transition should get its own probe so that we can distinguish them. I'm not actually sure what's going on here in particular though so I'm not sure what to call this probe. Maybe SLEEP_CHECK_TASK_YIELD_WAKE? @vtjnash any better suggestion?

Most of these transitions are the same, the code structure just makes them copy the same structure frequently. In particular, all events between a given pair of fences (or locks) are indistinguishable. In practice, that also means JL_PROBE_RT_SLEEP_CHECK_WAKE can be indistinguishable from this event, since there is no fence since here until the next time we re-enter this loop and set sleep_check_state to sleeping.

jpsamaroo · 2022-02-02T20:54:14Z

@tkf JL_PROBE_RT_RUN_TASK is already triggered in start_task: https://github.com/JuliaLang/julia/pull/43453/files#diff-1ac89047bd5ddf49f271321da0d86793efdc534d0edd56548b3ea0dcd2f37c04R917. Do you want a separate probe there? It would make it slightly easier to differentiate them.

tkf · 2022-02-02T20:55:55Z

Yeah, I think it's better to differentiate "start" and "re-schedule" (post-yield)

vtjnash · 2022-02-03T19:51:06Z

This still sounds good to me. Someone could identify in post-analysis the difference between start and resume and pause (they are always a state toggle), but separating them early seems better.

kpamnany · 2022-02-07T22:12:30Z

From @tkf's list:

new (JL_PROBE_RT_NEW_TASK)

This is now at the end of jl_new_task() but if we move it to the beginning then the allocation and setup cost can be attributed to the proper task in analysis.

schedule (jl_enqueue_task + sticky queue?)

This is missing and should be added here (maybe RT_TASKQ_INSERT?)

start (start_task)

This is done here and the rest in the list are there too.

Since adding 2 above gives us insert, it probably makes sense to also have a probe for deletemin here (maybe RT_TASKQ_GET?). I also am now wondering if it might be good to split these two probes into 3 (start/finish/fail) so that we can easily measure these latencies in different load situations. Your thoughts @tkf and @vtjnash?

@jpsamaroo demonstrated using BPFnative.jl with this PR to me today and it'll be very straightforward to get the desired functionality from #36870 implemented on this.

tkf · 2022-02-08T00:29:18Z

More info sounds good to me!

Co-authored-by: Takafumi Arakaki <[email protected]>

tkf added the multithreading Base.Threads and related functionality label Dec 18, 2021

jpsamaroo marked this pull request as draft December 18, 2021 03:00

jpsamaroo force-pushed the jps/task-switch-usdt branch from f729150 to 7d8cba0 Compare December 18, 2021 17:40

vchuravy reviewed Dec 18, 2021

View reviewed changes

src/uprobes.d Outdated Show resolved Hide resolved

jpsamaroo force-pushed the jps/task-switch-usdt branch 2 times, most recently from 1f3c389 to e8e8fbb Compare December 18, 2021 21:07

jpsamaroo force-pushed the jps/task-switch-usdt branch from e8e8fbb to 7ec1f00 Compare December 19, 2021 15:21

vchuravy reviewed Dec 30, 2021

View reviewed changes

src/task.c Outdated Show resolved Hide resolved

jpsamaroo force-pushed the jps/task-switch-usdt branch from 0a38d0c to c5f2ebb Compare December 30, 2021 17:21

tkf reviewed Dec 30, 2021

View reviewed changes

doc/src/devdocs/probes.md Outdated Show resolved Hide resolved

kpamnany mentioned this pull request Jan 4, 2022

Runtime event trace visualization #36870

Closed

3 tasks

kpamnany reviewed Jan 4, 2022

View reviewed changes

doc/src/devdocs/probes.md Outdated Show resolved Hide resolved

jpsamaroo force-pushed the jps/task-switch-usdt branch 2 times, most recently from b80872f to a3d87d1 Compare January 4, 2022 20:22

tkf reviewed Jan 7, 2022

View reviewed changes