Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Zero TCC_HIT_sum all the time #150

Open
RookieT0T opened this issue Dec 20, 2024 · 4 comments
Open

[Issue]: Zero TCC_HIT_sum all the time #150

RookieT0T opened this issue Dec 20, 2024 · 4 comments

Comments

@RookieT0T
Copy link

RookieT0T commented Dec 20, 2024

Problem Description

While using the rocprofv2 to collect performance counters like TCC_HIT_sum and TCC_MISS_sum on Vega 20, I found the value of TCC_HIT_sum is always 0 and TCC_MISS_sum shows some non-zero values, which I assume it works. If you can investigate why hit information is always 0 (including all hit information from 16 cache banks) and double check if the value of tcc miss is correct, that will be much appreciated. BTW, this problem exists regardless I collect the performance counters in ROCm version of 6.2.2-116 or in the docker image of 6.3.0.

Example output.csv returned from the profiler:
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,FlatVMemInsts,TCC_EA_RDREQ_sum,TCC_EA_RDREQ_32B_sum,TCC_HIT_sum,TCC_MISS_sum,TCC_MISS[12],TCC_MISS[13],TCC_MISS[14],TCC_MISS[15],TCC_HIT[0],TCC_HIT[1],TCC_HIT[2],TCC_HIT[3],TCC_HIT[4],TCC_HIT[5],TCC_HIT[6],TCC_HIT[7],TCC_HIT[8],TCC_HIT[9],TCC_HIT[10],TCC_HIT[11],TCC_HIT[12],TCC_HIT[13],TCC_HIT[14],TCC_HIT[15],TA_FLAT_WRITE_WAVEFRONTS_sum,TA_FLAT_READ_WAVEFRONTS_sum,TCC_EA_RDREQ[0],TCC_EA_RDREQ[1],TCC_EA_RDREQ[2],TCC_EA_RDREQ[3],TCC_EA_RDREQ[4],TCC_EA_RDREQ[5],TCC_EA_RDREQ[6],TCC_EA_RDREQ[7],TCC_EA_RDREQ[8],TCC_EA_RDREQ[9],TCC_EA_RDREQ[10],TCC_EA_RDREQ[11],TCC_EA_RDREQ[12],TCC_EA_RDREQ[13],TCC_EA_RDREQ[14],TCC_EA_RDREQ[15],TCC_MISS[0],TCC_MISS[1],TCC_MISS[2],TCC_MISS[3],TCC_MISS[4],TCC_MISS[5],TCC_MISS[6],TCC_MISS[7],TCC_MISS[8],TCC_MISS[9],TCC_MISS[10],TCC_MISS[11]
0,"kernel(int*) [clone .kd]",1,0,1,14761,14761,1,1,0,0,40,0,48,64,0x0,0x79eecbe84540,60.0000000000,68.0000000000,0.0000000000,0.0000000000,102.0000000000,4,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0000000000,60.0000000000,0,0,0,0,0,0,6,60,1,0,0,0,0,0,0,0,4,0,4,6,0,10,4,61,5,0,7,4

Operating System

Ubuntu 24.04.1 LTS

CPU

AMD Ryzen 9 3900X 12-Core Processor

GPU

gfx906 (AMD Vega 7nm also referred to as AMD Vega 20)

ROCm Version

ROCm 6.3.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@ppanchad-amd
Copy link

Hi @RookieT0T. Internal ticket has been created to assist with your issue. Thanks!

@zichguan-amd
Copy link

Hi @RookieT0T, can you share the workload that you are trying to profile? It's normal to have 0 L2 hit rate if your workload doesn't reuse any cached data.

@RookieT0T RookieT0T reopened this Dec 31, 2024
@RookieT0T
Copy link
Author

RookieT0T commented Dec 31, 2024

Hi @RookieT0T, can you share the workload that you are trying to profile? It's normal to have 0 L2 hit rate if your workload doesn't reuse any cached data.

My workload has a bunch of flat_load_dwordx2 instructions contained in the asm volatile brackets (in the kernel function). The addresses specified in those load instructions should incur some cache hits. Also, "glc" flag is specified at the end of each load instruction to enforce the cache accesses bypass the L1 cache like TCP and then directly go to the L2 cache like TCC.

Example of kernel function with only one load instruction:

global void kernel(int * arr) {
uint64_t a = 0;

asm volatile(
"s_waitcnt vmcnt(0) & lgkmcnt(0)\n\t"
"buffer_wbinvl1\n\t"
"flat_load_dwordx2 %[out0], %[in1] glc\n\t"
"s_waitcnt vmcnt(0) & lgkmcnt(0)\n\t"
"s_nop 0\n\t"

      : [out0]"=v"(a)
      : [in1]"v"((uint64_t *)&arr[0])
      : "memory");

}

@RookieT0T
Copy link
Author

Hi @RookieT0T, can you share the workload that you are trying to profile? It's normal to have 0 L2 hit rate if your workload doesn't reuse any cached data.

I am wondering if the flag "glc" is added, will the instruction cache hits also be part of the TCC hits sum reported by the profiler in addition to the data cache hits incurred by the program?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants