Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute q, k, v in parallel using OpenMP parallel sections #75

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Ea0011
Copy link

@Ea0011 Ea0011 commented Jul 25, 2023

OpenMP parallel sections construct enables computation of q, k, v in parallel on different threads. This can result in a performance gain.

Benchmark on 44M model on a 6 core Intel Core i7-8750h MacBook Pro 2018.

Before 85tok/s
After 101tok/s

Code was compiled with clang -Ofast -fopenmp -march=native -ffast-math.
Code was run with the following parameters.

OMP_NUM_THREADS=6 OMP_PLACES=cores OMP_MAX_ACTIVE_LEVELS=2 ./run ./out44m/model44m.bin
  • OMP_NUM_THREADS=6 specifies number of cores to use by OpenMP
  • OMP_PLACES=cores specifies that OpenMP should schedule threads on physical cores. If not specified, OpenMP could schedule on hyper threads and cause less performance boost.
  • OMP_MAX_ACTIVE_LEVELS=2 specifies that inner second level OpenMP parallel regions should also spawn threads. This is crucial as now parallel for in matmul is nested into parallel sections.

@kroggen
Copy link
Contributor

kroggen commented Jul 25, 2023

I was also wondering that these 3 matmuls could be parallelized as they are independent.

The downside is the additional parameter to run it

Can't the OMP_MAX_ACTIVE_LEVELS be defined inside the run.c file?

Maybe this:

#ifdef _OPENMP
  omp_set_max_active_levels(2);
#endif

But then the source becomes bloated? I don't know where is the red line

Maybe this project could have a separate branch for a highly optimized version, but not so small code

@karpathy
Copy link
Owner

My inclination is to not bloat the source code, instead rely on the environment flags for the run config, as the original post shows. They can be documented in the Readme. Trying this out...

@karpathy
Copy link
Owner

Weirdly this is slower on my machine... :\

@Ea0011
Copy link
Author

Ea0011 commented Jul 26, 2023

@karpathy

Interesting :] I suspect your machine runs into a parallelization overhead. It basically spends more time creating and scheduling threads rather than doing useful work. This is something that happens quite often actually. This behavior heavily depends on openmp configuration and the CPU parameters. May I ask on what machine you run the code on?

One can try varying number of max threads via OMP_NUM_THREADS to see if this is the case. In this example, I used 6 because my CPU has 6 physical cores. I also use OMP_PLACES=cores to prevent openmp from using virtual hyperthreads. I'll try to run the code with varying openmp parameters tomorrow and report back here.

@Ea0011
Copy link
Author

Ea0011 commented Jul 26, 2023

So, to test the usefulness of parallelizing q, k, v computation, I ran the base and parallel version introduced in this PR with different number of max threads defined by OMP_NUM_THREADS var. I benchmarked both the 44m and 110m models to see how the effect changes when moving to heavier models. Below are the table of achieved throughput for each configuration on 44m and 110m models in that order. The code was run with these params.

OMP_NUM_THREADS=n OMP_PLACES=cores OMP_MAX_ACTIVE_LEVELS=2 ./run

As can be seen, with correct max threads, the parallel version outperforms the current version. However, it is interesting to see that when allowing OpenMP to spawn more threads, the advantage vanishes. This is because the system spends a lot of time spawning and scheduling threads which causes a slowdown rather than speedup. So, a correct setup with OMP_NUM_THREADS is necessary for maximum throughput. It is worth to note that when you have nested parallel regions like in this PR, OMP_NUM_THREADS defines number of threads for each region. So, when nesting parallel regions, one should be careful not to spawn too much threads. In this case, 4 max threads with parallel q, k, v computation yields the fastest results. For both models, parallel computation results in about 17% speedup with respect to best performing base case setup (OMP_NUM_THREADS=8 for both models). Generally, low max threads favors parallel q, k, v computation. PR #45 also demonstrates the necessity of correct configuration for maximum performance which is machine and CPU specific.

model44m

OMP_NUM_THREADS Base Parallel
1 65 tok/s 65 tok/s
2 79 tok/s 98 tok/s
4 84 tok/s 111 tok/s
6 86 tok/s 96 tok/s
8 95 tok/s 80 tok/s
10 84 tok/s 71 tok/s
12 79 tok/s 63 tok/s

model110m

OMP_NUM_THREADS Base Parallel
1 30 tok/s 30 tok/s
2 32 tok/s 42 tok/s
4 34 tok/s 46 tok/s
6 37 tok/s 42 tok/s
8 39 tok/s 39 tok/s
10 35 tok/s 36 tok/s
12 35 tok/s 33 tok/s

@Ea0011
Copy link
Author

Ea0011 commented Jul 26, 2023

Also, notice a TODO in the README for OpenMP documentation. I wonder what kind of information about OpenMP would you like to be presented in the README? I might try to fill it up.

ompenmp parallel sections lets us compute qkv with different threads thereby gaining a speedup.

Benchmark on 44M model using a 6 core intel corei7 - 8750h
Before: 85 tok/s
After 101 tok/s
trholding added a commit to trholding/llama2.c that referenced this pull request Jul 20, 2024
runq - Experiment to verify speed up matmuls with OpenMP parallel sections

Ref: karpathy#75
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants