-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compute q, k, v in parallel using OpenMP parallel sections #75
base: master
Are you sure you want to change the base?
Conversation
I was also wondering that these 3 matmuls could be parallelized as they are independent. The downside is the additional parameter to run it Can't the Maybe this: #ifdef _OPENMP
omp_set_max_active_levels(2);
#endif But then the source becomes bloated? I don't know where is the red line Maybe this project could have a separate branch for a highly optimized version, but not so small code |
My inclination is to not bloat the source code, instead rely on the environment flags for the run config, as the original post shows. They can be documented in the Readme. Trying this out... |
Weirdly this is slower on my machine... :\ |
Interesting :] I suspect your machine runs into a parallelization overhead. It basically spends more time creating and scheduling threads rather than doing useful work. This is something that happens quite often actually. This behavior heavily depends on openmp configuration and the CPU parameters. May I ask on what machine you run the code on? One can try varying number of max threads via OMP_NUM_THREADS to see if this is the case. In this example, I used 6 because my CPU has 6 physical cores. I also use OMP_PLACES=cores to prevent openmp from using virtual hyperthreads. I'll try to run the code with varying openmp parameters tomorrow and report back here. |
So, to test the usefulness of parallelizing q, k, v computation, I ran the base and parallel version introduced in this PR with different number of max threads defined by OMP_NUM_THREADS var. I benchmarked both the 44m and 110m models to see how the effect changes when moving to heavier models. Below are the table of achieved throughput for each configuration on 44m and 110m models in that order. The code was run with these params. OMP_NUM_THREADS=n OMP_PLACES=cores OMP_MAX_ACTIVE_LEVELS=2 ./run As can be seen, with correct max threads, the parallel version outperforms the current version. However, it is interesting to see that when allowing OpenMP to spawn more threads, the advantage vanishes. This is because the system spends a lot of time spawning and scheduling threads which causes a slowdown rather than speedup. So, a correct setup with model44m
model110m
|
Also, notice a TODO in the README for OpenMP documentation. I wonder what kind of information about OpenMP would you like to be presented in the README? I might try to fill it up. |
ompenmp parallel sections lets us compute qkv with different threads thereby gaining a speedup. Benchmark on 44M model using a 6 core intel corei7 - 8750h Before: 85 tok/s After 101 tok/s
runq - Experiment to verify speed up matmuls with OpenMP parallel sections Ref: karpathy#75
OpenMP parallel sections construct enables computation of q, k, v in parallel on different threads. This can result in a performance gain.
Benchmark on 44M model on a 6 core Intel Core i7-8750h MacBook Pro 2018.
Before 85tok/s
After 101tok/s
Code was compiled with
clang -Ofast -fopenmp -march=native -ffast-math
.Code was run with the following parameters.
parallel for
inmatmul
is nested intoparallel sections
.