Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add USDTs for the task runtime #43453

Merged
merged 11 commits into from
Feb 9, 2022

Conversation

jpsamaroo
Copy link
Member

@jpsamaroo jpsamaroo commented Dec 17, 2021

These probes may prove useful for determining task boundaries (when a Julia thread is executing a given task), to be paired with other information, such as GC initiation.

@vchuravy do we want any other USDTs in this PR? I'd like to cover all cases where a task begins/finishes executing.

Todo:

  • Add task switch probes
  • Add process_events probes
  • Validate with BPFnative.jl
  • Remove PTLS argument from probes
  • Add get_next_task probes
  • Implement end-to-end demo dumping trace events to file

@tkf
Copy link
Member

tkf commented Dec 17, 2021

Maybe some of what @kpamnany did in #36870 makes sense as USDTs too?

Also, it'd be great if we can record what's going on in the multi-queue. I don't know if it's the correct way to do it, but maybe something like this?

             JL_UV_UNLOCK();
diff --git a/src/partr.c b/src/partr.c
index 048a841158..75eb367b09 100644
--- a/src/partr.c
+++ b/src/partr.c
@@ -142,6 +142,7 @@ static inline int multiq_insert(jl_task_t *task, int16_t priority)
     if (jl_atomic_load_relaxed(&heaps[rn].ntasks) >= tasks_per_heap) {
         uv_mutex_unlock(&heaps[rn].lock);
         // multiq insertion failed, increase #tasks per heap
+        JL_PROBE_RT_MULTIQ_INSERT_FULL(ptls->tid);
         return -1;
     }
 
@@ -154,6 +155,7 @@ static inline int multiq_insert(jl_task_t *task, int16_t priority)
         jl_atomic_store_relaxed(&heaps[rn].prio, task->prio);
     uv_mutex_unlock(&heaps[rn].lock);
 
+    JL_PROBE_RT_MULTIQ_INSERT_SUCCESS(ptls->tid);
     return 0;
 }
 
@@ -179,17 +181,22 @@ static inline jl_task_t *multiq_deletemin(void)
         else if (prio1 == prio2 && prio1 == INT16_MAX)
             continue;
         if (uv_mutex_trylock(&heaps[rn1].lock) == 0) {
-            if (prio1 == jl_atomic_load_relaxed(&heaps[rn1].prio))
+            if (prio1 == jl_atomic_load_relaxed(&heaps[rn1].prio)) {
+                JL_PROBE_RT_MULTIQ_DELETEMIN_TRYLOCK_SUCCESS(ptls->tid, i);
                 break;
+            }
             uv_mutex_unlock(&heaps[rn1].lock);
         }
     }
-    if (i == heap_p)
+    if (i == heap_p) {
+        JL_PROBE_RT_MULTIQ_DELETEMIN_EMPTYISH(ptls->tid);
         return NULL;
+    }
 
     task = heaps[rn1].tasks[0];
     if (!jl_set_task_tid(task, ptls->tid)) {
         uv_mutex_unlock(&heaps[rn1].lock);
+        JL_PROBE_RT_MULTIQ_DELETEMIN_TASKLOCK_FAILED(ptls->tid);
         goto retry;
     }
     int32_t ntasks = jl_atomic_load_relaxed(&heaps[rn1].ntasks) - 1;
@@ -204,6 +211,7 @@ static inline jl_task_t *multiq_deletemin(void)
     jl_atomic_store_relaxed(&heaps[rn1].prio, prio1);
     uv_mutex_unlock(&heaps[rn1].lock);
 
+    JL_PROBE_RT_MULTIQ_DELETEMIN_SUCCESS(ptls->tid, task);
     return task;
 }

I also wondered if it makes sense to record jl_process_events and if it can be used for debugging JuliaLang/Downloads.jl#136

diff --git a/src/jl_uv.c b/src/jl_uv.c
index 0f616cdebb..eff4c9a040 100644
--- a/src/jl_uv.c
+++ b/src/jl_uv.c
@@ -205,6 +205,7 @@ JL_DLLEXPORT int jl_process_events(void)
     jl_gc_safepoint_(ct->ptls);
     if (loop && (jl_atomic_load_relaxed(&_threadedregion) || jl_atomic_load_relaxed(&ct->tid) == 0)) {
         if (jl_atomic_load(&jl_uv_n_waiters) == 0 && jl_mutex_trylock(&jl_uv_mutex)) {
+            JL_PROBE_RT_PROCESS_EVENTS(ct);
             loop->stop_flag = 0;
             int r = uv_run(loop, UV_RUN_NOWAIT);
             JL_UV_UNLOCK();

But I've never succeeded in setting up bpftrace and so I don't know how much of it makes sense :)

@tkf tkf added the multithreading Base.Threads and related functionality label Dec 18, 2021
@jpsamaroo jpsamaroo marked this pull request as draft December 18, 2021 03:00
@jpsamaroo
Copy link
Member Author

Yeah, I'll have to give that a try!

Marking this as draft since I need to think longer on how all this will interact with things like user stack introspection; I'll probably split the ctx_switch probe into two.

@jpsamaroo
Copy link
Member Author

jpsamaroo commented Dec 18, 2021

And @tkf, I'll probably just implement a single JL_PROBE_RT_MULTIQ_STATE, with a Cint state argument which will indicate what multiq event happened (it's somewhat a pain to write and synchronize multiple probe programs). Nope, will take it mostly as-is, but I'm going to pass the full ptls instead of ptls->tid; does this seem excessive @tkf?

@jpsamaroo jpsamaroo force-pushed the jps/task-switch-usdt branch from f729150 to 7d8cba0 Compare December 18, 2021 17:40
src/uprobes.d Outdated Show resolved Hide resolved
@jpsamaroo jpsamaroo force-pushed the jps/task-switch-usdt branch 2 times, most recently from 1f3c389 to e8e8fbb Compare December 18, 2021 21:07
@tkf
Copy link
Member

tkf commented Dec 18, 2021

pass the full ptls instead of ptls->tid

Can you look at ptls->tid when using bpftrace? If so, I think passing ptls is actually better since that's more general.

@vchuravy
Copy link
Member

vchuravy commented Dec 19, 2021

Can you look at ptls->tid when using bpftrace? If so, I think passing ptls is actually better since that's more general.

index 27689d5dc6..fa59df9f15 100755
--- a/contrib/bpftrace/rt_all.bt
+++ b/contrib/bpftrace/rt_all.bt
@@ -1,5 +1,11 @@
 #!/usr/bin/env bpftrace
 
+// From julia_threads.h
+struct jl_tls_states_t {
+    int16_t tid;
+    // ...
+};
+
 BEGIN
 {
     printf("Tracing Julia Task events... Hit Ctrl-C to end.\n");
@@ -8,7 +14,8 @@ BEGIN
 usdt:usr/lib/libjulia-internal.so:julia:rt__run__task
 {
     @resume[ustack] = count();
-    printf("Task running: %x (with PTLS %x)\n", arg0, arg1);
+    @ptls = ((struct jl_tls_states_t*)arg1);
+    printf("Task running: %x (on TID %x)\n", arg0, @ptls->tid);
 }
 
 usdt:usr/lib/libjulia-internal.so:julia:rt__pause__task

Yes we can. Sadly #include usr/include/julia/julia.h since we then collide with the kernel variants of some types :/

@jpsamaroo jpsamaroo force-pushed the jps/task-switch-usdt branch from e8e8fbb to 7ec1f00 Compare December 19, 2021 15:21
@jpsamaroo
Copy link
Member Author

If we forget about bpftrace, we should be able to use CBinding.jl to parse Julia's headers, and with jpsamaroo/BPFnative.jl#31, we should be able to access PTLS struct members in Julia.

src/task.c Outdated Show resolved Hide resolved
@jpsamaroo jpsamaroo force-pushed the jps/task-switch-usdt branch from 0a38d0c to c5f2ebb Compare December 30, 2021 17:21
@jpsamaroo
Copy link
Member Author

@tkf how would you briefly describe the deletemin__trylock__* probes?

doc/src/devdocs/probes.md Outdated Show resolved Hide resolved
@kpamnany kpamnany mentioned this pull request Jan 4, 2022
3 tasks
@tkf
Copy link
Member

tkf commented Jan 4, 2022

For queue-agnostic trace points, maybe something like this?

diff --git a/src/partr.c b/src/partr.c
index 0ba4f08627..4f1958159f 100644
--- a/src/partr.c
+++ b/src/partr.c
@@ -419,15 +419,21 @@ JL_DLLEXPORT void jl_wakeup_thread(int16_t tid)
 // get the next runnable task from the multiq
 static jl_task_t *get_next_task(jl_value_t *trypoptask, jl_value_t *q)
 {
+    jl_task_t *ct = jl_current_task;
+    JL_PROBE_RT_GET_NEXT_TASK_BEGIN(ct);
     jl_gc_safepoint();
     jl_value_t *args[2] = { trypoptask, q };
     jl_task_t *task = (jl_task_t*)jl_apply(args, 2);
     if (jl_typeis(task, jl_task_type)) {
         int self = jl_atomic_load_relaxed(&jl_current_task->tid);
         jl_set_task_tid(task, self);
+        JL_PROBE_RT_POP_STICKY_SUCCESS(ct, task);
         return task;
     }
-    return multiq_deletemin();
+    JL_PROBE_RT_POP_STICKY_FAILED(ct);
+    task = multiq_deletemin();
+    JL_PROBE_RT_GET_NEXT_TASK_END(ct, task);
+    return task;
 }
 
 static int may_sleep(jl_ptls_t ptls) JL_NOTSAFEPOINT

...or ptls instead of ct, although that's a pointer fetch away. But if we want to minimize the computation then pgcstack is better than ct. I have no idea if it matters though :) jl_apply loads these things anyway

@tkf
Copy link
Member

tkf commented Jan 4, 2022

Or maybe we also want to put the trace points in the task pop loop in jl_task_get_next? @kpamnany

Also, JL_PROBE_RT_GET_NEXT_TASK_BEGIN vs JL_PROBE_RT_TASK_GET_NEXT_BEGIN etc. are terrible way to distinguish them... Maybe call the first ones (the one inside get_next_task) JL_PROBE_RT_TASK_TRY_GET_NEXT_BEGIN etc.?

diff --git a/src/partr.c b/src/partr.c
index 0ba4f08627..e7e50b336b 100644
--- a/src/partr.c
+++ b/src/partr.c
@@ -445,9 +445,11 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
     jl_task_t *ct = jl_current_task;
     uint64_t start_cycles = 0;
 
+    JL_PROBE_RT_TASK_GET_NEXT_BEGIN(ct);
     while (1) {
         jl_task_t *task = get_next_task(trypoptask, q);
         if (task)
+            JL_PROBE_RT_TASK_GET_NEXT_END(ct, start_cycles);
             return task;
 
         // quick, race-y check to see if there seems to be any stuff in there
@@ -472,6 +474,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
             if (task) {
                 if (jl_atomic_load_relaxed(&ptls->sleep_check_state) != not_sleeping)
                     jl_atomic_store(&ptls->sleep_check_state, not_sleeping); // let other threads know they don't need to wake us
+                JL_PROBE_RT_TASK_GET_NEXT_END(ct, start_cycles);
                 return task;
             }
 
@@ -534,6 +537,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
             }
 
             // the other threads will just wait for on signal to resume
+            JL_PROBE_RT_SLEEP_BEGIN(ct);
             JULIA_DEBUG_SLEEPWAKE( ptls->sleep_enter = cycleclock() );
             int8_t gc_state = jl_gc_safe_enter(ptls);
             uv_mutex_lock(&sleep_locks[ptls->tid]);
@@ -544,6 +548,7 @@ JL_DLLEXPORT jl_task_t *jl_task_get_next(jl_value_t *trypoptask, jl_value_t *q)
             assert(jl_atomic_load_relaxed(&ptls->sleep_check_state) == not_sleeping);
             uv_mutex_unlock(&sleep_locks[ptls->tid]);
             JULIA_DEBUG_SLEEPWAKE( ptls->sleep_leave = cycleclock() );
+            JL_PROBE_RT_SLEEP_END(ct);
             jl_gc_safe_leave(ptls, gc_state); // contains jl_gc_safepoint
             start_cycles = 0;
         }

doc/src/devdocs/probes.md Outdated Show resolved Hide resolved
@jpsamaroo jpsamaroo force-pushed the jps/task-switch-usdt branch 2 times, most recently from b80872f to a3d87d1 Compare January 4, 2022 20:22
src/partr.c Outdated Show resolved Hide resolved
src/partr.c Outdated Show resolved Hide resolved
src/partr.c Outdated Show resolved Hide resolved
src/partr.c Outdated Show resolved Hide resolved
@jpsamaroo jpsamaroo force-pushed the jps/task-switch-usdt branch from 3af9192 to d628f62 Compare January 19, 2022 15:28
src/partr.c Outdated Show resolved Hide resolved
@IanButterworth
Copy link
Member

Note that if you try it out on a recent Gtk.jl version you'll need to have a Gtk window open to enable the eventloop. Also that any keyboard (like a ctrl-c) or mouse input will unblock the 5 second idle timeout, also it doesn't happen during profiling.. It's a tricky one to investigate!

@kpamnany
Copy link
Contributor

kpamnany commented Feb 1, 2022

Agreed, that would be useful to investigate with this! We'll need some way to detect when a task is finished for that; I'll have to figure out where that hook would go.

Here I think.

Edit to add: we want one here anyway so we can track task lifetimes in general.

@kpamnany
Copy link
Contributor

kpamnany commented Feb 1, 2022

(Also curious how to feed this data into https://perfetto.dev/, but that can come later 🙂)

I think we can use the hooks here to insert writes to a JSON as done in #36870.

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase?

src/task.c Outdated Show resolved Hide resolved
@tkf
Copy link
Member

tkf commented Feb 2, 2022

I think the events related to task lifetime are

  1. new (JL_PROBE_RT_NEW_TASK)
  2. schedule (jl_enqueue_task + sticky queue?)
  3. start (start_task)
  4. "de-schedule" (JL_PROBE_RT_PAUSE_TASK) (occurs multiple times)
  5. "re-schedule" (JL_PROBE_RT_RUN_TASK) (occurs multiple times)
  6. finish (JL_PROBE_RT_FINISH_TASK)

I think looking at the interval of 3--6 is more important than 1--6 since (shortly before) 3 is where the stack is allocated. That is to say, it let us know how long a task holds onto stack which is a rather expensive resource. So, maybe it makes sense to add yet another trace point in start_task?

(I'm not sure how much 2 is useful though. It's roughly the same as 1 unless the user throws away the created Task or something. It's nice to have, but maybe not very essential. Also, maybe it's tricky to put a cheap trace point in a Julia function?)

@jpsamaroo jpsamaroo force-pushed the jps/task-switch-usdt branch from 1f98662 to 3ca21ab Compare February 2, 2022 20:50
jl_atomic_store_relaxed(&ptls->sleep_check_state, not_sleeping); // let other threads know they don't need to wake us
JL_PROBE_RT_SLEEP_CHECK_TASK_WAKE(ptls);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpamnany should this get its own probe? It was introduced during a rebase, so I just copy-pasta'd the probe from below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in general, every state transition should get its own probe so that we can distinguish them. I'm not actually sure what's going on here in particular though so I'm not sure what to call this probe. Maybe SLEEP_CHECK_TASK_YIELD_WAKE? @vtjnash any better suggestion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these transitions are the same, the code structure just makes them copy the same structure frequently. In particular, all events between a given pair of fences (or locks) are indistinguishable. In practice, that also means JL_PROBE_RT_SLEEP_CHECK_WAKE can be indistinguishable from this event, since there is no fence since here until the next time we re-enter this loop and set sleep_check_state to sleeping.

@jpsamaroo
Copy link
Member Author

@tkf JL_PROBE_RT_RUN_TASK is already triggered in start_task: https://github.com/JuliaLang/julia/pull/43453/files#diff-1ac89047bd5ddf49f271321da0d86793efdc534d0edd56548b3ea0dcd2f37c04R917. Do you want a separate probe there? It would make it slightly easier to differentiate them.

@tkf
Copy link
Member

tkf commented Feb 2, 2022

Yeah, I think it's better to differentiate "start" and "re-schedule" (post-yield)

@vtjnash
Copy link
Member

vtjnash commented Feb 3, 2022

This still sounds good to me. Someone could identify in post-analysis the difference between start and resume and pause (they are always a state toggle), but separating them early seems better.

@kpamnany
Copy link
Contributor

kpamnany commented Feb 7, 2022

From @tkf's list:

  1. new (JL_PROBE_RT_NEW_TASK)

This is now at the end of jl_new_task() but if we move it to the beginning then the allocation and setup cost can be attributed to the proper task in analysis.

  1. schedule (jl_enqueue_task + sticky queue?)

This is missing and should be added here (maybe RT_TASKQ_INSERT?)

  1. start (start_task)

This is done here and the rest in the list are there too.

Since adding 2 above gives us insert, it probably makes sense to also have a probe for deletemin here (maybe RT_TASKQ_GET?). I also am now wondering if it might be good to split these two probes into 3 (start/finish/fail) so that we can easily measure these latencies in different load situations. Your thoughts @tkf and @vtjnash?

@jpsamaroo demonstrated using BPFnative.jl with this PR to me today and it'll be very straightforward to get the desired functionality from #36870 implemented on this.

@tkf
Copy link
Member

tkf commented Feb 8, 2022

More info sounds good to me!

@vchuravy vchuravy merged commit 7ccc83e into JuliaLang:master Feb 9, 2022
@jpsamaroo jpsamaroo deleted the jps/task-switch-usdt branch February 9, 2022 02:29
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request Feb 17, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants