Collect built-in Accelerator performance counter #681
Replies: 4 comments 6 replies
-
Hey @LeiZhou-97, thank you for sharing this interesting use case! The reason we prefer BPF over Linux perf is because Linux perf has more overhead. The great thing is that any PMU event that Linux perf can collect, BPF can collect as well. In Kepler, we currently have code in two places for collecting performance events: the BPF code here and the golang code here. Each performance event is associated with a hexadecimal code. We use the "golang.org/x/sys/unix" library, which provides a list of widely available perf events, as seen here. However, since your case is more specific, you'll need to utilize the Linux perf to get the code, and then create a const in our code with this value. Additionally, we might need to update the power mode to incorporate this new counter. Could you please create a pull request (PR)? That would be really helpful! |
Beta Was this translation helpful? Give feedback.
-
Thanks @marceloamaral for quick reply, actually this topic is related to the latest discussion here, we plan to introduce a in-die(built-in) accelerator framework in current Kepler power source, before that, we need to first figure out a proper way to identify those new perf events. Let's discuss here:) |
Beta Was this translation helpful? Give feedback.
-
There is a misunderstanding, @marceloamaral. The difficulty we are facing is not retrieving the specific event's counter, but the counter delta and its related application pid relationship establishment. Since the AMX events' execution is asynchronous to the application process, the current delta calculation methodology in task_switch hook is not applicable to the case. |
Beta Was this translation helpful? Give feedback.
-
Got it @jiere... Does perf counts the AMX events per processes? If yes, can you check the implementation to see how they do it? |
Beta Was this translation helpful? Give feedback.
-
Currently, I'm working on collecting AMX (the next generation of AVX512) performance counters. On Intel SPR CPU, it has the following hw counter.
EXE.AMX_BUSY: Counts the cycles where the AMX (Advance Matrix Extension) unit is busy performing an operation.
Based on this counter, we can get how long has AMX been used.
Previously, I tried to use existing bpf framework in kepler to collect amx_busy like cpu_cycles. But I found this doesn't work for AMX.
Cycles/cache_miss are synchronous hw counters. The counter values between task_switch belong the current running process.
For AMX, after CPU offloads data to AMX, the CPU will be occupied by other processes when AMX is running. So the counter value is counted on other processes.
Now my idea is to use Linux perf directly to track the amx-related counter values generated by this pid. I know it must have some overhead, but in bpf program to trace PMU event also depends on linux perf subsystem. So I personally think this is acceptable, and a config can be added to the kepler later to let the end user decide whether to turn on this feature, there is a trade-off.
Is this solution acceptable to the community? And everyone is welcome to suggest your better ideas. Thanks!
Beta Was this translation helpful? Give feedback.
All reactions