Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] SM80_CP_ASYNC_ support L2 cache prefetch hints #1174

Closed
reed-lau opened this issue Nov 3, 2023 · 6 comments
Closed

[FEA] SM80_CP_ASYNC_ support L2 cache prefetch hints #1174

reed-lau opened this issue Nov 3, 2023 · 6 comments
Labels
feature request New feature or request

Comments

@reed-lau
Copy link
Contributor

reed-lau commented Nov 3, 2023

Is your feature request related to a problem? Please describe.
For Ampere architecture(SM80+), cp.async instruction support .level::prefetch_size as a fetch hint.
I found it worked for my case.
I wish SM80_CP_ASYNC_s structure and their Traits in cute could support this feature.

Describe the solution you'd like

My solution is adding an integer template parameter named L2PrefetchSize to specify the prefetch_size.
For the implementation, we use the if constexpr to dispatch to different assembly code at compile time(value from 0/64/128/256).
The template parameter L2PrefetchSize is set to 0 to indicate no prefetch is made by default.

template <class TS, class TD = TS, int L2PrefetchSize = 0>
struct SM80_CP_ASYNC_CACHEALWAYS {
  if constexpr (L2PrefetchSize == 0) {
    asm volatile('cp.async... ');
  } else if constexpr (L2PrefetchSize == 64) {
    asm volatile('cp.async...L2::64 ...');
  } ...
  } else {
    static_assert(0, "unsupport prefetch size for cp.async");
  }
  
}

If you approve this solution, I could help PR it.

@reed-lau reed-lau added ? - Needs Triage feature request New feature or request labels Nov 3, 2023
@thakkarV
Copy link
Collaborator

thakkarV commented Nov 3, 2023

@ccecka what do you think?

@hwu36
Copy link
Collaborator

hwu36 commented Nov 3, 2023

@reed-lau , could you make a pr to just change the ptx to use 128B prefetch all the time. it is the same behavior as 2.x then.

@reed-lau
Copy link
Contributor Author

reed-lau commented Nov 6, 2023

@hwu36 In some cases enabling L2 prefetching may kill performance, what do you think about this issue. How about leaving the option to the end user?

@hwu36
Copy link
Collaborator

hwu36 commented Nov 6, 2023

What cases?

@reed-lau
Copy link
Contributor Author

reed-lau commented Nov 6, 2023

What cases?

I remember when I was optimizing sparse convolutions for a lidar network, enabling L2 perfetch could hurt performance. But I'm not so sure now.
I will do a PR first(#1177), and when I encounter this case in the future, I will do a test and comment it here.

@reed-lau
Copy link
Contributor Author

When cp.async is used, 128B prefetch is always enabled. #1177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants