-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make AccCpuThreads parallelize over blocks #1135
Comments
Similar to However, it is not yet clear to me how this can improve things. Yes, we could simply execute multiple blocks in parallel. However, this restricts the block size to only a single thread. This is necessary because threads within a block can be explicitly synchronized/interrupted which is not possible by default with What we would need is a mix of cooperative multitasking (
I already proposed this in #22. |
Yes, having only one thread per block, like AccOmp2Blocks is the idea. Alpaka kernels are probably mostly written with GPU (i.e. the CUDA backend) in mind, which will often lead to one the following two properties:
The two cases above are currently slow because:
With the type of GPU codes I described above, one would ideally map block to OS threads and threads to SIMD, because this would be closest to the GPU architectures. I know, that this is not viable, because SIMD lacks some important bits of SIMT (i.e. __syncthreads() and scattered memory load/store). The best we can is to map OS threads to blocks, ignore threads and leave SIMD to the element layer.
|
Btw. OS threads sharing data in a block are probably just as likely to slow each other down due to false sharing (causing each others caches to be flushed) as they are to profit from sharing cache lines. |
Yes, this is probably one of the main issues with the current version of |
Are suggesting to implement premtive multi-tasking a thread-level in alpaka by hand? Maybe using coroutines? |
Boost.Fiber is a library with an interface similar to std::thread but based on Boost.Context which implements cooperative multitasking by hand. |
The Intel oneAPI compiler for CPUs solves the issue in the following way:
So when the compiler encounters a work-group it will launch its auto-vectorizer. I wonder if we can do something similar in alpaka, maybe by making better use of the element layer. |
AccCpuThreads is currently a bad showcase of C++11 threads as it uses the sub-optimal strategy of spawning CPU threads at thread instead of block level just like the equally useless AccOmp2Threads.
To escape Amdahl's law as long as possible you must parallelize at as high a level at possible. AccCpuThreads should do this.
I am aware, that AccCpuThreads is, and will remain, only a demonstration, thus there will not be many resources to be spend on improving it. Just putting the issue here for reference, based on a discussion we had.
The text was updated successfully, but these errors were encountered: