[BUG]: Performance degradation after updating to Mojo v 0.4.0 #1030

tairov · 2023-10-09T23:37:57Z

Bug description

After updating to Mojo version 0.4.0 we adapted llama2.mojo code to latest changes, primarily related to global Runtime. Now master branch of llama2.mojo is compatible with Mojo 0.4.0
But with this upgrade we noticed some performance degradation on smaller baby llama models like stories15M.bin

v0.3.1 achieved tok/s = from 420 to 450 tok/s
v0.4.0 achieved tok/s = from 390 to 410 tok/s

Mojo v. 0.3.1

$ mojo --version
mojo 0.3.1 (a3eed7c8)

$ git clone [email protected]:tairov/llama2.mojo.git
# checkout to latest commit compatible with v. 0.3.1
$ git checkout e9c0705c84a578a62e30b2104ca483073c9d092c 
$ cd llama2.mojo
$ mojo build llama2.mojo
$ ./llama2 stories15M.bin -t 0.0 -s 100
num hardware threads:  3
SIMD vector width:  16
checkpoint size:  60816028 [ 57 MB ]
n layers:  6
vocab size:  32000
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red ball in the sky. It was the sun! She thought it was so pretty.
Lily wanted to play with the ball, but it was too high up in the sky. She tried to jump and reach it, but she couldn't. Then, she had an idea. She would use a stick to knock the ball down.
Lily found a stick and tried to hit the ball. But the stick was too short. She tried again and again, but she couldn't reach it. She felt sad.
Suddenly, a kind man came by and saw Lily. He asked her what was wrong. Lily told him about the ball. The man smiled and said, "I have a useful idea!" He took out a long stick and used it to knock the ball down. Lily was so happy! She thanked the man and they played together in the sunshine.
achieved tok/s:  440.0

Mojo v. 0.4.0

$ git clone [email protected]:tairov/llama2.mojo.git
$ cd llama2.mojo
$ mojo build llama2.mojo

$ mojo --version
mojo 0.4.0 (9e33b013)
$ ./llama2 stories15M.bin -j 3 -t 0.0 -s 100
num hardware threads: 3  SIMD width: 16
checkpoint size:  60816028 [ 57 MB ] | n layers: 6 | vocab size: 32000
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red ball in the sky. It was the sun! She thought it was so pretty.
Lily wanted to play with the ball, but it was too high up in the sky. She tried to jump and reach it, but she couldn't. Then, she had an idea. She would use a stick to knock the ball down.
Lily found a stick and tried to hit the ball. But the stick was too short. She tried again and again, but she couldn't reach it. She felt sad.
Suddenly, a kind man came by and saw Lily. He asked her what was wrong. Lily told him about the ball. The man smiled and said, "I have a useful idea!" He took out a long stick and used it to knock the ball down. Lily was so happy! She thanked the man and they played together in the sunshine.
achieved tok/s:  398.55072463768113

Steps to reproduce

Include relevant code snippet or link to code that did not work as expected.

see descriptoin

If applicable, add screenshots to help explain the problem.
If using the Playground, name the pre-existing notebook that failed and the steps that led to failure.
Include anything else that might help us debug the issue.

System information

- What OS did you do install Mojo on ?
Ubuntu 20.04
- Provide version information for Mojo by pasting the output of `mojo -v`
mojo -v
mojo 0.3.1 (a3eed7c8)

mojo -v
mojo 0.4.0 (9e33b013)


- Provide Modular CLI version by pasting the output of `modular -v`

modular -v
modular 0.1.4 (6b54d308)

The text was updated successfully, but these errors were encountered:

jackos · 2023-10-10T02:56:09Z

See if this PR helps: tairov/llama2.mojo#46

For me on macOS m2 max it more than doubled performance, might have similar results in Linux

tairov · 2023-10-10T19:33:24Z

thanks @jackos , I merged your PR , though I decided to keep workers default value unchanged
As you can see in my description I use recently introduced -j param when executing ./llama2 , this param allow users to set desired amount of workers/cores with default value = num_cores()
So I executed both tests with cores = 3 ( for my test machine it means 3 out of 6 cores )

jackos · 2023-10-10T20:30:46Z

No worries thanks @tairov, investigating what's happening here thanks for raising

prabhuramachandran · 2023-10-15T17:53:23Z

I can also report performance degradation with the new changes to all my toy examples. Unfortunately, I lost my earlier 0.3.1 install in the update and cannot get back those numbers (#1032 would be nice). Even earlier the vanilla parallelize function was slower than the case where one explicitly passed a Runtime instance. Setting the number of workers in the parallelize function (with v 0.4.0) does not seem to increase the number of threads being used according to htop and the system monitor.

Mogball · 2023-11-03T00:08:54Z

Did we ever get to the bottom of this?

jackos · 2023-11-03T03:22:45Z

@Mogball Building and testing older versions on my M2 max shows improved performance with global runtime:

0.3.1
Throughput of a 128x128 matrix multiplication in Python:
0.00595777921927949 GFLOP/s
Throughput of a 512x512 matrix multiplication in Mojo using a naive algorithm:
6.2878590498800753 GFLOP/s <> 1055 x speedup over Python
Throughput of a 512x512 matrix multiplication in Mojo using vectorization:
22.415820339052122 GFLOP/s <> 3762 x speedup over Python
Throughput of a 512x512 matrix multiplication in Mojo using the stdlib `vectorize`:
22.311811429918414 GFLOP/s <> 3744 x speedup over Python
Throughput of a 512x512 {vectorized + parallelized} matrix multiplication in Mojo:
47.302493854924229 GFLOP/s <> 7939 x speedup over Python
Throughput of a 512x512 {tiled + vectorized + parallelized} matrix multiplication in Mojo:
46.219825445061026 GFLOP/s <> 7757 x speedup over Python
Throughput of a 512x512 {tiled + unrolled + vectorized + parallelized} matrix multiplication in Mojo:
49.211583140549266 GFLOP/s <> 8260 x speedup over Python

0.4.0
Throughput of a 128x128 matrix multiplication in Python:
0.005589797536271804 GFLOP/s
Throughput of a 512x512 matrix multiplication in Mojo using a naive algorithm:
5.8369915752720472 GFLOP/s <> 1044 x speedup over Python
Throughput of a 512x512 matrix multiplication in Mojo using vectorization:
20.857565991801668 GFLOP/s <> 3731 x speedup over Python
Throughput of a 512x512 matrix multiplication in Mojo using the stdlib `vectorize`:
22.003744896834181 GFLOP/s <> 3936 x speedup over Python
Throughput of a 512x512 {vectorized + parallelized} matrix multiplication in Mojo:
53.450444701130209 GFLOP/s <> 9562 x speedup over Python
Throughput of a 512x512 {tiled + vectorized + parallelized} matrix multiplication in Mojo:
58.424512941257383 GFLOP/s <> 10451 x speedup over Python
Throughput of a 512x512 {tiled + unrolled + vectorized + parallelized} matrix multiplication in Mojo:
62.657210194410105 GFLOP/s <> 11209 x speedup over Python

0.5.0
Python:         0.006 GFLOPS
Numpy:        117.664 GFLOPS
Naive:          6.313 GFLOPS   1074.06x Python  0.05x Numpy
Vectorized:    21.996 GFLOPS   3741.99x Python  0.19x Numpy
Parallelized:  52.245 GFLOPS   8887.96x Python  0.44x Numpy
Tiled:         59.179 GFLOPS  10067.71x Python  0.50x Numpy
Unrolled:      63.535 GFLOPS  10808.73x Python  0.54x Numpy
Accumulated:  562.716 GFLOPS  95730.25x Python  4.78x Numpy

There were other changes made to llama2.mojo between versions. I don't see any overhead caused by the global runtime in the isolated matmul test.

jackos · 2023-11-04T17:35:17Z

EDIT: Might be specific to Linux, testing now

jackos · 2024-01-09T00:49:02Z

Resolved in next release

tairov added bug Something isn't working mojo Issues that are related to mojo labels Oct 9, 2023

jackos mentioned this issue Oct 10, 2023

[Feature Request] Allow installing different mojo versions from modular SDK #1032

Closed

1 task

ematejska added the mojo-lang Tag for all issues related to language. label Oct 10, 2023

jackos self-assigned this Oct 10, 2023

Mogball added the mojo-examples Updates to the examples label Oct 19, 2023

Sharktheone mentioned this issue Nov 3, 2023

[BUG]: 10x String allocations moving from 0.4.0 to 0.5.0 in recursive loop #1216

Open

jackos closed this as completed Jan 9, 2024

ematejska added the mojo-repo Tag all issues with this label label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Performance degradation after updating to Mojo v 0.4.0 #1030

[BUG]: Performance degradation after updating to Mojo v 0.4.0 #1030

tairov commented Oct 9, 2023

jackos commented Oct 10, 2023

tairov commented Oct 10, 2023

jackos commented Oct 10, 2023

prabhuramachandran commented Oct 15, 2023

Mogball commented Nov 3, 2023

jackos commented Nov 3, 2023

jackos commented Nov 4, 2023 •

edited

Loading

jackos commented Jan 9, 2024

[BUG]: Performance degradation after updating to Mojo v 0.4.0 #1030

[BUG]: Performance degradation after updating to Mojo v 0.4.0 #1030

Comments

tairov commented Oct 9, 2023

Bug description

Steps to reproduce

System information

jackos commented Oct 10, 2023

tairov commented Oct 10, 2023

jackos commented Oct 10, 2023

prabhuramachandran commented Oct 15, 2023

Mogball commented Nov 3, 2023

jackos commented Nov 3, 2023

jackos commented Nov 4, 2023 • edited Loading

jackos commented Jan 9, 2024

jackos commented Nov 4, 2023 •

edited

Loading