Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use less DataLayouts internals in DSS #2049

Merged
merged 1 commit into from
Oct 20, 2024
Merged

Conversation

charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Oct 17, 2024

This is a first step towards #2048.

@charleskawczynski charleskawczynski force-pushed the ck/dss_less_internals branch 2 times, most recently from c57b739 to 8d923f6 Compare October 18, 2024 20:57
@charleskawczynski charleskawczynski changed the title Ck/dss less internals Use less DataLayouts internals in DSS Oct 18, 2024
@charleskawczynski charleskawczynski force-pushed the ck/dss_less_internals branch 3 times, most recently from 540e9a7 to cd82e7b Compare October 18, 2024 21:02
@charleskawczynski charleskawczynski marked this pull request as ready for review October 18, 2024 21:03
@sriharshakandala
Copy link
Member

sriharshakandala commented Oct 18, 2024

Generally speaking, max_threads = 256 works well. If resource constraints (shmem size, register usage, etc.) permit, CUDA will automatically schedule multiple thread blocks on the same streaming multiprocessor and no additional intervention is required.
Using threads_via_occupancy does not necessarily guarantee optimality! Its up to the user to design the best threadblock configuration for their application, and I believe we should custom tune it, especially for performance critical kernels!

@charleskawczynski charleskawczynski force-pushed the ck/dss_less_internals branch 3 times, most recently from fd90d44 to 42f2474 Compare October 19, 2024 01:34
@charleskawczynski
Copy link
Member Author

Generally speaking, max_threads = 256 works well. If resource constraints (shmem size, register usage, etc.) permit, CUDA will automatically schedule multiple thread blocks on the same streaming multiprocessor and no additional intervention is required. Using threads_via_occupancy does not necessarily guarantee optimality! Its up to the user to design the best threadblock configuration for their application, and I believe we should custom tune it, especially for performance critical kernels!

I'm fine with reverting that for now because it's not really part of the refactor. I am curious how this differs from the occupancy API

@charleskawczynski charleskawczynski force-pushed the ck/dss_less_internals branch 3 times, most recently from 90db469 to b446499 Compare October 19, 2024 17:17
@charleskawczynski charleskawczynski merged commit 0635ff3 into main Oct 20, 2024
17 checks passed
@charleskawczynski charleskawczynski deleted the ck/dss_less_internals branch October 20, 2024 01:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants