Compute q, k, v in parallel using OpenMP parallel sections #75

Ea0011 · 2023-07-25T14:35:31Z

OpenMP parallel sections construct enables computation of q, k, v in parallel on different threads. This can result in a performance gain.

Benchmark on 44M model on a 6 core Intel Core i7-8750h MacBook Pro 2018.

Before 85tok/s
After 101tok/s

Code was compiled with clang -Ofast -fopenmp -march=native -ffast-math.
Code was run with the following parameters.

OMP_NUM_THREADS=6 OMP_PLACES=cores OMP_MAX_ACTIVE_LEVELS=2 ./run ./out44m/model44m.bin

OMP_NUM_THREADS=6 specifies number of cores to use by OpenMP
OMP_PLACES=cores specifies that OpenMP should schedule threads on physical cores. If not specified, OpenMP could schedule on hyper threads and cause less performance boost.
OMP_MAX_ACTIVE_LEVELS=2 specifies that inner second level OpenMP parallel regions should also spawn threads. This is crucial as now parallel for in matmul is nested into parallel sections.

kroggen · 2023-07-25T23:19:51Z

I was also wondering that these 3 matmuls could be parallelized as they are independent.

The downside is the additional parameter to run it

Can't the OMP_MAX_ACTIVE_LEVELS be defined inside the run.c file?

Maybe this:

#ifdef _OPENMP
  omp_set_max_active_levels(2);
#endif

But then the source becomes bloated? I don't know where is the red line

Maybe this project could have a separate branch for a highly optimized version, but not so small code

karpathy · 2023-07-25T23:35:29Z

My inclination is to not bloat the source code, instead rely on the environment flags for the run config, as the original post shows. They can be documented in the Readme. Trying this out...

karpathy · 2023-07-25T23:40:12Z

Weirdly this is slower on my machine... :\

Ea0011 · 2023-07-26T00:17:39Z

@karpathy

Interesting :] I suspect your machine runs into a parallelization overhead. It basically spends more time creating and scheduling threads rather than doing useful work. This is something that happens quite often actually. This behavior heavily depends on openmp configuration and the CPU parameters. May I ask on what machine you run the code on?

One can try varying number of max threads via OMP_NUM_THREADS to see if this is the case. In this example, I used 6 because my CPU has 6 physical cores. I also use OMP_PLACES=cores to prevent openmp from using virtual hyperthreads. I'll try to run the code with varying openmp parameters tomorrow and report back here.

Ea0011 · 2023-07-26T09:47:43Z

So, to test the usefulness of parallelizing q, k, v computation, I ran the base and parallel version introduced in this PR with different number of max threads defined by OMP_NUM_THREADS var. I benchmarked both the 44m and 110m models to see how the effect changes when moving to heavier models. Below are the table of achieved throughput for each configuration on 44m and 110m models in that order. The code was run with these params.

OMP_NUM_THREADS=n OMP_PLACES=cores OMP_MAX_ACTIVE_LEVELS=2 ./run

As can be seen, with correct max threads, the parallel version outperforms the current version. However, it is interesting to see that when allowing OpenMP to spawn more threads, the advantage vanishes. This is because the system spends a lot of time spawning and scheduling threads which causes a slowdown rather than speedup. So, a correct setup with OMP_NUM_THREADS is necessary for maximum throughput. It is worth to note that when you have nested parallel regions like in this PR, OMP_NUM_THREADS defines number of threads for each region. So, when nesting parallel regions, one should be careful not to spawn too much threads. In this case, 4 max threads with parallel q, k, v computation yields the fastest results. For both models, parallel computation results in about 17% speedup with respect to best performing base case setup (OMP_NUM_THREADS=8 for both models). Generally, low max threads favors parallel q, k, v computation. PR #45 also demonstrates the necessity of correct configuration for maximum performance which is machine and CPU specific.

model44m

OMP_NUM_THREADS	Base	Parallel
1	65 tok/s	65 tok/s
2	79 tok/s	98 tok/s
4	84 tok/s	*111 tok/s*
6	86 tok/s	96 tok/s
8	95 tok/s	80 tok/s
10	84 tok/s	71 tok/s
12	79 tok/s	63 tok/s

model110m

OMP_NUM_THREADS	Base	Parallel
1	30 tok/s	30 tok/s
2	32 tok/s	42 tok/s
4	34 tok/s	*46 tok/s*
6	37 tok/s	42 tok/s
8	39 tok/s	39 tok/s
10	35 tok/s	36 tok/s
12	35 tok/s	33 tok/s

Ea0011 · 2023-07-26T10:49:23Z

Also, notice a TODO in the README for OpenMP documentation. I wonder what kind of information about OpenMP would you like to be presented in the README? I might try to fill it up.

ompenmp parallel sections lets us compute qkv with different threads thereby gaining a speedup. Benchmark on 44M model using a 6 core intel corei7 - 8750h Before: 85 tok/s After 101 tok/s

runq - Experiment to verify speed up matmuls with OpenMP parallel sections Ref: karpathy#75

axrwl mentioned this pull request Jul 25, 2023

Fuse q, k, v matmuls #55

Closed

Ea0011 force-pushed the master branch from ed029cf to 7dbaeb6 Compare July 25, 2023 19:24

feat(parallelism): compute q, k, v in parallel using openmp

09d94d4

ompenmp parallel sections lets us compute qkv with different threads thereby gaining a speedup. Benchmark on 44M model using a 6 core intel corei7 - 8750h Before: 85 tok/s After 101 tok/s

Ea0011 force-pushed the master branch from 7dbaeb6 to 09d94d4 Compare July 28, 2023 13:04

trholding added a commit to trholding/llama2.c that referenced this pull request Jul 20, 2024

runq - Add OpenMP parallel regions

fae1157

runq - Experiment to verify speed up matmuls with OpenMP parallel sections Ref: karpathy#75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute q, k, v in parallel using OpenMP parallel sections #75

Compute q, k, v in parallel using OpenMP parallel sections #75

Ea0011 commented Jul 25, 2023 •

edited

Loading

kroggen commented Jul 25, 2023

karpathy commented Jul 25, 2023

karpathy commented Jul 25, 2023

Ea0011 commented Jul 26, 2023

Ea0011 commented Jul 26, 2023 •

edited

Loading

Ea0011 commented Jul 26, 2023

Compute q, k, v in parallel using OpenMP parallel sections #75

Are you sure you want to change the base?

Compute q, k, v in parallel using OpenMP parallel sections #75

Conversation

Ea0011 commented Jul 25, 2023 • edited Loading

kroggen commented Jul 25, 2023

karpathy commented Jul 25, 2023

karpathy commented Jul 25, 2023

Ea0011 commented Jul 26, 2023

Ea0011 commented Jul 26, 2023 • edited Loading

model44m

model110m

Ea0011 commented Jul 26, 2023

Ea0011 commented Jul 25, 2023 •

edited

Loading

Ea0011 commented Jul 26, 2023 •

edited

Loading