-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use prescribed thread-block configurations #1969
Conversation
Here is a compressed summary of the results. Global specs:
fill!
copyto!
stencils
So, the notable improvements here will be on the |
78058a0
to
4417516
Compare
Closes #1854. |
226bec3
to
d34f647
Compare
Can we also add the usual benchmarks for these kernels with a sync statement at the end? Might be easier to use benchmark tools for gathering timing information! |
The |
d34f647
to
7ed62c9
Compare
Overall, looks good to me! |
Based on comparing #1963 against our main branch, this PR removes conversions from linear to cartesian indexes, and instead uses partitioned thread-block configurations in order to improve the performance of some kernels. xref: JuliaGPU/KernelAbstractions.jl#470.
Also, these launch configurations all start with using CUDA's occupancy API, in order to get a safer bound on how many threads to use. I've seen some errors due to launching kernels with too many threads on the main branch (in some of the builds from this PR).
I'll try making the comparison easier, but for now I'm going to just paste the results:
Main
fill!
copyto!
stencils
This PR
fill!
copyto!
stencils