Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Performance degradation after updating to Mojo v 0.4.0 #1030

Closed
tairov opened this issue Oct 9, 2023 · 8 comments
Closed

[BUG]: Performance degradation after updating to Mojo v 0.4.0 #1030

tairov opened this issue Oct 9, 2023 · 8 comments
Assignees
Labels
bug Something isn't working mojo Issues that are related to mojo mojo-examples Updates to the examples mojo-lang Tag for all issues related to language. mojo-repo Tag all issues with this label

Comments

@tairov
Copy link

tairov commented Oct 9, 2023

Bug description

After updating to Mojo version 0.4.0 we adapted llama2.mojo code to latest changes, primarily related to global Runtime. Now master branch of llama2.mojo is compatible with Mojo 0.4.0
But with this upgrade we noticed some performance degradation on smaller baby llama models like stories15M.bin

v0.3.1 achieved tok/s = from 420 to 450 tok/s
v0.4.0 achieved tok/s = from 390 to 410 tok/s

Mojo v. 0.3.1

$ mojo --version
mojo 0.3.1 (a3eed7c8)

$ git clone [email protected]:tairov/llama2.mojo.git
# checkout to latest commit compatible with v. 0.3.1
$ git checkout e9c0705c84a578a62e30b2104ca483073c9d092c 
$ cd llama2.mojo
$ mojo build llama2.mojo
$ ./llama2 stories15M.bin -t 0.0 -s 100
num hardware threads:  3
SIMD vector width:  16
checkpoint size:  60816028 [ 57 MB ]
n layers:  6
vocab size:  32000
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red ball in the sky. It was the sun! She thought it was so pretty.
Lily wanted to play with the ball, but it was too high up in the sky. She tried to jump and reach it, but she couldn't. Then, she had an idea. She would use a stick to knock the ball down.
Lily found a stick and tried to hit the ball. But the stick was too short. She tried again and again, but she couldn't reach it. She felt sad.
Suddenly, a kind man came by and saw Lily. He asked her what was wrong. Lily told him about the ball. The man smiled and said, "I have a useful idea!" He took out a long stick and used it to knock the ball down. Lily was so happy! She thanked the man and they played together in the sunshine.
achieved tok/s:  440.0

Mojo v. 0.4.0

$ git clone [email protected]:tairov/llama2.mojo.git
$ cd llama2.mojo
$ mojo build llama2.mojo

$ mojo --version
mojo 0.4.0 (9e33b013)
$ ./llama2 stories15M.bin -j 3 -t 0.0 -s 100
num hardware threads: 3  SIMD width: 16
checkpoint size:  60816028 [ 57 MB ] | n layers: 6 | vocab size: 32000
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red ball in the sky. It was the sun! She thought it was so pretty.
Lily wanted to play with the ball, but it was too high up in the sky. She tried to jump and reach it, but she couldn't. Then, she had an idea. She would use a stick to knock the ball down.
Lily found a stick and tried to hit the ball. But the stick was too short. She tried again and again, but she couldn't reach it. She felt sad.
Suddenly, a kind man came by and saw Lily. He asked her what was wrong. Lily told him about the ball. The man smiled and said, "I have a useful idea!" He took out a long stick and used it to knock the ball down. Lily was so happy! She thanked the man and they played together in the sunshine.
achieved tok/s:  398.55072463768113

Steps to reproduce

  • Include relevant code snippet or link to code that did not work as expected.
  • see descriptoin
  • If applicable, add screenshots to help explain the problem.
  • If using the Playground, name the pre-existing notebook that failed and the steps that led to failure.
  • Include anything else that might help us debug the issue.

System information

- What OS did you do install Mojo on ?
Ubuntu 20.04
- Provide version information for Mojo by pasting the output of `mojo -v`
mojo -v
mojo 0.3.1 (a3eed7c8)

mojo -v
mojo 0.4.0 (9e33b013)


- Provide Modular CLI version by pasting the output of `modular -v`

modular -v
modular 0.1.4 (6b54d308)
@tairov tairov added bug Something isn't working mojo Issues that are related to mojo labels Oct 9, 2023
@jackos
Copy link
Collaborator

jackos commented Oct 10, 2023

See if this PR helps: tairov/llama2.mojo#46

For me on macOS m2 max it more than doubled performance, might have similar results in Linux

@ematejska ematejska added the mojo-lang Tag for all issues related to language. label Oct 10, 2023
@tairov
Copy link
Author

tairov commented Oct 10, 2023

thanks @jackos , I merged your PR , though I decided to keep workers default value unchanged
As you can see in my description I use recently introduced -j param when executing ./llama2 , this param allow users to set desired amount of workers/cores with default value = num_cores()
So I executed both tests with cores = 3 ( for my test machine it means 3 out of 6 cores )

@jackos
Copy link
Collaborator

jackos commented Oct 10, 2023

No worries thanks @tairov, investigating what's happening here thanks for raising

@jackos jackos self-assigned this Oct 10, 2023
@prabhuramachandran
Copy link
Contributor

I can also report performance degradation with the new changes to all my toy examples. Unfortunately, I lost my earlier 0.3.1 install in the update and cannot get back those numbers (#1032 would be nice). Even earlier the vanilla parallelize function was slower than the case where one explicitly passed a Runtime instance. Setting the number of workers in the parallelize function (with v 0.4.0) does not seem to increase the number of threads being used according to htop and the system monitor.

@Mogball Mogball added the mojo-examples Updates to the examples label Oct 19, 2023
@Mogball
Copy link

Mogball commented Nov 3, 2023

Did we ever get to the bottom of this?

@jackos
Copy link
Collaborator

jackos commented Nov 3, 2023

@Mogball Building and testing older versions on my M2 max shows improved performance with global runtime:

0.3.1
Throughput of a 128x128 matrix multiplication in Python:
0.00595777921927949 GFLOP/s
Throughput of a 512x512 matrix multiplication in Mojo using a naive algorithm:
6.2878590498800753 GFLOP/s <> 1055 x speedup over Python
Throughput of a 512x512 matrix multiplication in Mojo using vectorization:
22.415820339052122 GFLOP/s <> 3762 x speedup over Python
Throughput of a 512x512 matrix multiplication in Mojo using the stdlib `vectorize`:
22.311811429918414 GFLOP/s <> 3744 x speedup over Python
Throughput of a 512x512 {vectorized + parallelized} matrix multiplication in Mojo:
47.302493854924229 GFLOP/s <> 7939 x speedup over Python
Throughput of a 512x512 {tiled + vectorized + parallelized} matrix multiplication in Mojo:
46.219825445061026 GFLOP/s <> 7757 x speedup over Python
Throughput of a 512x512 {tiled + unrolled + vectorized + parallelized} matrix multiplication in Mojo:
49.211583140549266 GFLOP/s <> 8260 x speedup over Python

0.4.0
Throughput of a 128x128 matrix multiplication in Python:
0.005589797536271804 GFLOP/s
Throughput of a 512x512 matrix multiplication in Mojo using a naive algorithm:
5.8369915752720472 GFLOP/s <> 1044 x speedup over Python
Throughput of a 512x512 matrix multiplication in Mojo using vectorization:
20.857565991801668 GFLOP/s <> 3731 x speedup over Python
Throughput of a 512x512 matrix multiplication in Mojo using the stdlib `vectorize`:
22.003744896834181 GFLOP/s <> 3936 x speedup over Python
Throughput of a 512x512 {vectorized + parallelized} matrix multiplication in Mojo:
53.450444701130209 GFLOP/s <> 9562 x speedup over Python
Throughput of a 512x512 {tiled + vectorized + parallelized} matrix multiplication in Mojo:
58.424512941257383 GFLOP/s <> 10451 x speedup over Python
Throughput of a 512x512 {tiled + unrolled + vectorized + parallelized} matrix multiplication in Mojo:
62.657210194410105 GFLOP/s <> 11209 x speedup over Python

0.5.0
Python:         0.006 GFLOPS
Numpy:        117.664 GFLOPS
Naive:          6.313 GFLOPS   1074.06x Python  0.05x Numpy
Vectorized:    21.996 GFLOPS   3741.99x Python  0.19x Numpy
Parallelized:  52.245 GFLOPS   8887.96x Python  0.44x Numpy
Tiled:         59.179 GFLOPS  10067.71x Python  0.50x Numpy
Unrolled:      63.535 GFLOPS  10808.73x Python  0.54x Numpy
Accumulated:  562.716 GFLOPS  95730.25x Python  4.78x Numpy

There were other changes made to llama2.mojo between versions. I don't see any overhead caused by the global runtime in the isolated matmul test.

@jackos
Copy link
Collaborator

jackos commented Nov 4, 2023

EDIT: Might be specific to Linux, testing now

@jackos
Copy link
Collaborator

jackos commented Jan 9, 2024

Resolved in next release

@jackos jackos closed this as completed Jan 9, 2024
@ematejska ematejska added the mojo-repo Tag all issues with this label label May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mojo Issues that are related to mojo mojo-examples Updates to the examples mojo-lang Tag for all issues related to language. mojo-repo Tag all issues with this label
Projects
None yet
Development

No branches or pull requests

5 participants