Collect built-in Accelerator performance counter #681

LeiZhou-97 · 2023-05-12T04:25:17Z

LeiZhou-97
May 12, 2023

Currently, I'm working on collecting AMX (the next generation of AVX512) performance counters. On Intel SPR CPU, it has the following hw counter.

EXE.AMX_BUSY: Counts the cycles where the AMX (Advance Matrix Extension) unit is busy performing an operation.

Based on this counter, we can get how long has AMX been used.

Previously, I tried to use existing bpf framework in kepler to collect amx_busy like cpu_cycles. But I found this doesn't work for AMX.

Cycles/cache_miss are synchronous hw counters. The counter values between task_switch belong the current running process.

For AMX, after CPU offloads data to AMX, the CPU will be occupied by other processes when AMX is running. So the counter value is counted on other processes.

Now my idea is to use Linux perf directly to track the amx-related counter values generated by this pid. I know it must have some overhead, but in bpf program to trace PMU event also depends on linux perf subsystem. So I personally think this is acceptable, and a config can be added to the kepler later to let the end user decide whether to turn on this feature, there is a trade-off.

Is this solution acceptable to the community? And everyone is welcome to suggest your better ideas. Thanks!

marceloamaral · 2023-05-12T05:49:34Z

marceloamaral
May 12, 2023
Maintainer

Hey @LeiZhou-97, thank you for sharing this interesting use case!

The reason we prefer BPF over Linux perf is because Linux perf has more overhead.

The great thing is that any PMU event that Linux perf can collect, BPF can collect as well.

In Kepler, we currently have code in two places for collecting performance events: the BPF code here and the golang code here.

Each performance event is associated with a hexadecimal code. We use the "golang.org/x/sys/unix" library, which provides a list of widely available perf events, as seen here. However, since your case is more specific, you'll need to utilize the Linux perf to get the code, and then create a const in our code with this value.

Additionally, we might need to update the power mode to incorporate this new counter.

Could you please create a pull request (PR)? That would be really helpful!

6 replies

LeiZhou-97 May 15, 2023
Author

Let me add some more comments.

AMX doesn't depends on the specific kfunc, user can even use AMX in userspace.

Sample code: https://github.com/intel/AMX-TMUL-Code-Samples/blob/main/src/test-amxtile.c

marceloamaral May 15, 2023
Maintainer

When an application begins using the AMX accelerator, it requests access permission through a kernel call using the following syntax: syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA). The kernel function is possibly related to this call here.
Then we can determine when the process starts utilizing the accelerator using this kernel function.

When the application finishes using the AMX, it releases the accelerator using the function _tile_release(). Although not entirely certain, it is likely that this function calls the kernel function tile_release.

LeiZhou-97 May 16, 2023
Author

Yep, you are right. Using arch_prctl can be seen if the current process has permission to use AMX.

But it cannot determine if AMX is being used. As I said before, the actual use of AMX does not depend on the kernel.

We can use arch_prctl as the config for checking if AMX is being used by the current process, and use Linux Perf for HW counter collection.

marceloamaral May 16, 2023
Maintainer

Can multiple processes utilize AMX simultaneously?

If one process obtains access permission through arch_prctl, can another process also acquire the permission, or does the previous process need to release it first?

LeiZhou-97 May 16, 2023
Author

Can multiple processes utilize AMX simultaneously?
[Yes, AMX is an in-die accelerator.]

If one process obtains access permission through arch_prctl, can another process also acquire the permission
[Yes]

or does the previous process need to release it first?
[Not need, and I also bpftrace the func tile_release(), it does not be called during AMX running. The function annotation said it will be invoked from the cpuidle driver and prevent from entering low-power idle]

Then I thought about it again, and I take back what I said before that arch_prctl can be used as AMX trace enabling config. I think it's better to use a displaced configuration for end user. Because some of them may not need to trace AMX events when they use it, as we all know, tracing AMX events all the time has overhead.

jiere · 2023-05-12T15:24:31Z

jiere
May 12, 2023
Collaborator

Thanks @marceloamaral for quick reply, actually this topic is related to the latest discussion here, we plan to introduce a in-die(built-in) accelerator framework in current Kepler power source, before that, we need to first figure out a proper way to identify those new perf events. Let's discuss here:)

0 replies

jiere · 2023-05-15T05:39:32Z

jiere
May 15, 2023
Collaborator

There is a misunderstanding, @marceloamaral. The difficulty we are facing is not retrieving the specific event's counter, but the counter delta and its related application pid relationship establishment. Since the AMX events' execution is asynchronous to the application process, the current delta calculation methodology in task_switch hook is not applicable to the case.

0 replies

marceloamaral · 2023-05-15T05:52:36Z

marceloamaral
May 15, 2023
Maintainer

Got it @jiere...

Does perf counts the AMX events per processes? If yes, can you check the implementation to see how they do it?
Is there any kernel functions that is called when a processes start/stop using the AMX?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect built-in Accelerator performance counter #681

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Collect built-in Accelerator performance counter #681

LeiZhou-97 May 12, 2023

Replies: 4 comments · 6 replies

marceloamaral May 12, 2023 Maintainer

LeiZhou-97 May 15, 2023 Author

marceloamaral May 15, 2023 Maintainer

LeiZhou-97 May 16, 2023 Author

marceloamaral May 16, 2023 Maintainer

LeiZhou-97 May 16, 2023 Author

jiere May 12, 2023 Collaborator

jiere May 15, 2023 Collaborator

marceloamaral May 15, 2023 Maintainer

LeiZhou-97
May 12, 2023

Replies: 4 comments 6 replies

marceloamaral
May 12, 2023
Maintainer

LeiZhou-97 May 15, 2023
Author

marceloamaral May 15, 2023
Maintainer

LeiZhou-97 May 16, 2023
Author

marceloamaral May 16, 2023
Maintainer

LeiZhou-97 May 16, 2023
Author

jiere
May 12, 2023
Collaborator

jiere
May 15, 2023
Collaborator

marceloamaral
May 15, 2023
Maintainer